Synchronous distributed tensorflow training runs asynchronously

Synchronous distributed tensorflow training runs asynchronously - tensorflow

System Information:
Debian 4.5.5
TF installed from binary (pip3 install tensorflow-gpu==1.0.1 --user)
TF version: v1.0.0-65-g4763edf-dirty 1.0.1
Bazel version: N.A.
CUDA 8.0 cuDNN v5.1
Steps to reproduce
Make a directory and download the following files into it:
training.py run.sh
Run the command ./run.sh to simply reproduce this issue.
Detailed descriptions for the bug
Recently, I tried to deploy the synchronous distributed tensorflow training on the cluster. I followed the tutorial and the inception example to write my own program. The training.py is from other user's implementation, which follows the same API usage as the official example. I modified it to enable it running on a single machine with multiple GPUs by making them communicate through localhost and mapping each worker to see only one GPU.
The run.sh launched three processes. One of them is the parameter server and the others are two workers implemented by between-graph replication. I created the training supervisor by tf.train.Supervisor() to manage multiple sessions in the distributed training for the initialization and synchronization.
I expect these two workers would synchronize each batch and work in the same epoch. However, the worker 0, which is launched prior to the worker 1, completed the whole training set without waiting for the worker 1. After that, the process of the worker 0 finished training process and exited normally while worker 1 behaved like falling into the deadlock and keep near 0% utilization of CPU and GPU for several hours.
Based on my observation, I suspect these two workers didn't communicate and synchronize at all for the data they passed. I report this problem as a bug because I create the optimizer tf.train.SyncReplicasOptimizer as suggested by the official website and the inception example. However, it seems that the synchronization behaviors, if any, are very strange and the program can not exit normally.
Source code / logs
Two files:
training.py: This file contains the source code for the parameter server and workers created to use synchronous distributed optimizers (tf.train.SyncReplicasOptimizer).
run.sh: This file launched the parameter server and the workers.
Log:
Please produce according to the steps and look at worker_0_log and worker_1_log

Related

testcontainers-python hanging while showing "waiting to be ready...", then fails

I'm running my unit testing code for neo4j.
My environment:
Ubuntu 20.04LTS server
1Gb Memory
1CPU
Here is what is displayed in the console:
====================================== test session starts ======================================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: ~/morsvq, configfile: pytest.ini
plugins: mock-3.8.2
collected 2 items
---------------------------------------- live log setup -----------------------------------------
INFO testcontainers.core.container:container.py:52 Pulling image neo4j:latest
INFO testcontainers.core.container:container.py:63 Container started: ad7963ed01
INFO testcontainers.core.waiting_utils:waiting_utils.py:46 Waiting to be ready...
INFO testcontainers.core.waiting_utils:waiting_utils.py:46 Waiting to be ready...
ERROR neo4j:__init__.py:571 Failed to read from defunct connection IPv4Address(('localhost', 49153)) (IPv4Address(('127.0.0.1', 49153)))
The same code runs successfully on a faster virtual machine with 8Gb Memory. So the code itself shouldn't be faulty. My suspision is that there is something to do with my configuration, so that it now consumes to much memory?
I've checked the official websites' documentation, but it doesn't mention the memory problem. I wonder if someone has encountered similar problem? How to fix this?

Disclaimer: I am a maintainer of tc-java, so I have only some basic experience with tc-python. However, some facts and constraints are universal across Testcontainers language implementations.
As you already wrote, the code runs fine on a more powerful machine, while it fails on an extremely limited machine. 1GB of RAM is not much, I would expect it is generally not enough to successfully start a Neo4j Docker container without memory swapping. Swapping would make the startup and interactions very slow, hence the startup timeout triggers.
For further debugging, you can run the Neo4j container directly using Docker CLI on your environment and see how it behaves.

Get mapping between /dev/nvidia* and nvidia-smi gpu list

A server with 4 GPUs is used for deep learning.
It often happens that GPU memory is not freed after the training process was terminated (killed). Results shown by nvidia-smi is
Nvidia-smi results
The cuda device 2 is used. (Might be a process launched with CUDA_VISIBLE_DEVICES=2)
Some sub-processes are still alive thus occupy the memory.
One bruce-force solution is to kill all processes created by python using:
pkill -u user_name python
This should be helpful if there is only one process to be cleaned up.
Another solution proposed by pytorch official My GPU memory isn’t freed properly
One may find them via
ps -elf | grep python.
However, if multiple processes are launched and we only want to kill the ones that related to a certain GPU, we can group processes by the gpu index (nvidia0, nvidia1, ...) as:
fuser -v /dev/nvidia*
fuser -v results
As we can see, /dev/nvidia3 is used by some python threads. Thus /dev/nvidia3 corresponds to cuda device 2.
The problem is: I want to kill certain processes launched with setting of CUDA_VISIBLE_DEVICES=2, but I do not know the gpu index (/dev/nvidia0, /dev/nvidia1, ...).
How to find the mapping between CUDA_VISIBLE_DEVICES={0,1,2,3} and /dev/nvidia{0,1,2,3}.

If you set CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable, then the order should be consistent between CUDA and nvidia-smi.
There is also another option (if you are sure the limiting to a specific GPU is done via CUDA_VISIBLE_DEVICES env var). Every process' environment can be examined in /proc/${PID}/environ. The format is partially binary, but grepping through the output usually works (if you force grep to treat the file as text file). This might require root privileges.

Submit a Keras training job to Google cloud

I am trying to follow this tutorial:
https://medium.com/#natu.neeraj/training-a-keras-model-on-google-cloud-ml-cb831341c196
to upload and train a Keras model on Google Cloud Platform, but I can't get it to work.
Right now I have downloaded the package from GitHub, and I have created a cloud environment with AI-Platform and a bucket for storage.
I am uploading the files (with the suggested folder structure) to my Cloud Storage bucket (basically to the root of my storage), and then trying the following command in the cloud terminal:
gcloud ai-platform jobs submit training JOB1
--module-name=trainer.cnn_with_keras
--package-path=./trainer
--job-dir=gs://mykerasstorage
--region=europe-north1
--config=gs://mykerasstorage/trainer/cloudml-gpu.yaml
But I get errors, first the cloudml-gpu.yaml file can't be found, it says "no such folder or file", and trying to just remove it, I get errors because it says the --init--.py file is missing, but it isn't, even if it is empty (which it was when I downloaded from the tutorial GitHub). I am Guessing I haven't uploaded it the right way.
Any suggestions of how I should do this? There is really no info on this in the tutorial itself.
I have read in another guide that it is possible to let gcloud package and upload the job directly, but I am not sure how to do this or where to write the commands, in my terminal with gcloud command? Or in the Cloud Shell in the browser? And how do I define the path where my python files are located?
Should mention that I am working with Mac, and pretty new to using Keras and Python.

I was able to follow the tutorial you mentioned successfully, with some modifications along the way.
I will mention all the steps although you made it halfway as you mentioned.
First of all create a Cloud Storage Bucket for the job:
gsutil mb -l europe-north1 gs://keras-cloud-tutorial
To answer your question on where you should write these commands, depends on where you want to store the files that you will download from GitHub. In the tutorial you posted, the writer is using his own computer to run the commands and that's why he initializes the gcloud command with gcloud init. However, you can submit the job from the Cloud Shell too, if you download the needed files there.
The only files we need from the repository are the trainer folder and the setup.py file. So, if we put them in a folder named keras-cloud-tutorial we will have this file structure:
keras-cloud-tutorial/
├── setup.py
└── trainer
├── __init__.py
├── cloudml-gpu.yaml
└── cnn_with_keras.py
Now, a possible reason for the ImportError: No module named eager error is that you might have changed the runtimeVersion inside the cloudml-gpu.yaml file. As we can read here, eager was introduced in Tensorflow 1.5. If you have specified an earlier version, it is expected to experience this error. So the structure of cloudml-gpu.yaml should be like this:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
masterType: standard_gpu
runtimeVersion: "1.5"
Note: "standard_gpu" is a legacy machine type.
Also, the setup.py file should look like this:
from setuptools import setup, find_packages
setup(name='trainer',
version='0.1',
packages=find_packages(),
description='Example on how to run keras on gcloud ml-engine',
author='Username',
author_email='user#gmail.com',
install_requires=[
'keras==2.1.5',
'h5py'
],
zip_safe=False)
Attention: As you can see, I have specified that I want version 2.1.5 of keras. This is because if I don't do that, the latest version is used which has compatibility issues with versions of Tensorflow earlier than 2.0.
If everything is set, you can submit the job by running the following command inside the folder keras-cloud-tutorial:
gcloud ai-platform jobs submit training test_job --module-name=trainer.cnn_with_keras --package-path=./trainer --job-dir=gs://keras-cloud-tutorial --region=europe-west1 --config=trainer/cloudml-gpu.yaml
Note: I used gcloud ai-platform instead of gcloud ml-engine command although both will work. At some point in the future though, gcloud ml-engine will be deprecated.
Attention: Be careful when choosing the region in which the job will be submitted. Some regions do not support GPUs and will throw an error if chosen. For example, if in my command I set the region parameter to europe-north1 instead of europe-west1, I will receive the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED:
Quota failure for project . The request for 1 K80
accelerators exceeds the allowed maximum of 0 K80, 0 P100, 0 P4, 0 T4,
0 TPU_V2, 0 TPU_V3, 0 V100. To read more about Cloud ML Engine quota,
see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure violations:
- description: The request for 1 K80 accelerators exceeds the allowed maximum of
0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V3, 0 V100.
subject:
You can read more about the features of each region here and here.
EDIT:
After the completion of the training job, there should be 3 folders in the bucket that you specified: logs/, model/ and packages/. The model is saved on the model/ folder a an .h5 file. Have in mind that if you set a specific folder for the destination you should include the '/' at the end. For example, you should set gs://my-bucket/output/ instead of gs://mybucket/output. If you do the latter you will end up with folders output, outputlogs and outputmodel. Inside output there should be packages. The job page link should direct to output folder so make sure to check the rest of the bucket too!
In addition, in the AI-Platform job page you should be able to see information regarding CPU, GPU and Network utilization:
Also, I would like to clarify something as I saw that you posted some related questions as an answer:
Your local environment, either it is your personal Mac or the Cloud Shell has nothing to do with the actual training job. You don't need to install any specific package or framework locally. You just need to have the Google Cloud SDK installed (in Cloud Shell is of course already installed) to run the appropriate gcloud and gsutil commands. You can read more on how exactly training jobs on the AI-Platform work here.
I hope that you will find my answer helpful.

I got it to work halfway now by not uploading the files but just running the upload commands from cloud at my local terminal... however there was an error during it running ending in "job failed"
Seems it was trying to import something from the TensorFlow backend called "from tensorflow.python.eager import context" but there was an ImportError: No module named eager
I have tried "pip install tf-nightly" which was suggested at another place, but it says I don't have permission or I am loosing the connection to cloud shell(exactly when I try to run the command).
I have also tried making a virtual environment locally to match that on gcloud (with Conda), and have made an environment with Conda with Python=3.5, Tensorflow=1.14.0 and Keras=2.2.5, which should be supported for gcloud.
The python program works fine in this environment locally, but I still get the (ImportError: No module named eager) when trying to run the job on gcloud.
I am putting the flag --python-version 3.5 when submitting the job, but when I write the command "Python -V" in the google cloud shell, it says Python=2.7. Could this be the issue? I have not fins a way to update the python version with the cloud shell prompt, but it says google cloud should support python 3.5. If this is anyway the issue, any suggestions on how to upgrade python version on google cloud?
It is also possible to manually there a new job in the google cloud web interface, doing this, I get a different error message: ERROR: Could not find a version that satisfies the requirement cnn_with_keras.py (from versions: none) and No matching distribution found for cnn_with_keras.py. Where cnn_with_keras.py is my python code from the tutorial, which runs fine locally.
Really don't know what to do next. Any suggestions or tips would be very helpful!

The issue with the GPU is solved now, it was something so simple as, my google cloud account had GPU settings disabled and needed to be upgraded.

How to get stats for several gem5 full system userland benchmark programs under Linux without having to reboot multiple times?

I know how to modify my image, reboot, and re-run it, but that would make my experiments very slow, since boot takes a few minutes.
Is there a way to quickly switch:
command line options
the executable
that is being run after boot?
This is not trivial because the Linux kernel knows about:
the state of the root filesystem
the state of memory, and therefore of kernel CLI options that could be used to modify init
so I can't just switch those after a checkpoint.
This question is inspired from: https://www.mail-archive.com/gem5-users#gem5.org/msg16959.html

Here is a fully automated setup that can help you to do it.
The basic workflow is as follows:
run your benchmark from the init executable that gets passed to the Linux kernel
https://unix.stackexchange.com/questions/122717/how-to-create-a-custom-linux-distro-that-runs-just-one-program-and-nothing-else/238579#238579
https://unix.stackexchange.com/questions/174062/can-the-init-process-be-a-shell-script-in-linux/395375#395375
to run a single benchmark with different parameters without
rebooting, do in your init script:
m5 checkpoint
m5 readfile | sh
and set the contents of m5 readfile on a host filesystem file before restoring the checkpoint with:
echo 'm5 resetstats && ./run-benchmark && m5 dumpstats' > path/to/script
build/ARM/gem5.opt configs/example/fs.py --script path/to/script
This is what the "configs/boot/hack_back_ckpt.rcS" but I think that
script is overly complicated.
to modify the executable without having to reboot, attach a second
disk image and mount after the checkpoint is restored:
How to attach multiple disk images in a simulation with gem5 fs.py?
Another possibility would be 9P but it is not working currently: http://gem5.org/WA-gem5
to only count only benchmark instructions, do:
m5 resetstats
./run-benchmark
m5 dumpstats
If that is not precise enough, modify the source of your benchmark with m5ops magic instructions that do resetstats and dumpstats from within the benchmark as shown at: How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?
to make the first boot faster, you can boot with a simple CPU model like the default AtomicSimpleCPU and then switch to a more detailed, slower model after the checkpoint is restored: How to switch CPU models in gem5 after restoring a checkpoint and then observe the difference?

tensorflow inception for retraining / fine tuning with a pretrained model: inception_train

I tried to retrain (new images, new classes) on top of the pretrained inception model, I therefor followed the instructions of the inception readme:
https://github.com/tensorflow/models/tree/master/inception#how-to-construct-a-new-dataset-for-retraining
I successfully built and ran build_image_data using bazel, as described in the tutorial. Afterwards I successfully built inception_train using bazel:
~/tensorflowmodels/models/inception# bazel build inception/inception_train
INFO: Found 1 target...
Target //inception:inception_train up-to-date (nothing to build)
INFO: Elapsed time: 0.073s, Critical Path: 0.00s
However, running bazel-bin/inception/inception_train I always get the following:
~/tensorflowmodels/models/inception# bazel-bin/inception/inception_train --train_dir="/" --validation_dir="/" --data_dir="/images_jpg/" --pretrained_model_checkpoint_path="/tensorflowmodels/models/inception/inception-v3/" --fine_tune=True --initial_learning_rate=0.001 --input_queue_memory_factor=1 --num_gpus=1
-bash: bazel-bin/inception/inception_train: No such file or directory
Naturally I would say it's by 99.9999% chance a typo. So then I tried to run inception_train.py with python. I had to change some import locations, and it finally ran with the parameters. However the script stops without any error messages after the initialization of the CUDA drivers.
Any help on how to solve this (or perform fine tuning / retraining with inception) would be very much appreciated.
tensorflow version: 0.9rc0
CPU: Xeon 5, 24 cores
GPU: Grid K2 8 GB
OS: Ubuntu 14.04
BTW I posted this already as an Github issue (which was closed, since it would be more a case for Stack Overflow).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas