Distributed TensorFlow - tensor content changes while sending - tensorflow

I'm writing a contrib extension to distributed tensorflow, overwriting Rendezvous::RecvFromRemoteAsync(). To validate my solution, I added tensor checksums in various points in the code (sender & receiver). Strangely, I see that the checksum changes while I'm still inside the send code.
So to simplify the check, I created the following function (pseudo):
TestChecksum(Tensor t, int delay):
1. int64 checksum1 = checksum(t)
2. usleep(delay)
3. int64 checksum2 = checksum(t)
4. CHECK(checksum1 == checksum2);
Now, I'm calling this function at the start of the RecvLocalAsync() callback, in the original GRPC code (right here).
For delay 100000 (micro), test passes.
For delay 200000 (micro), test fails.
Also, I checked the tensor's buffer and saw that it is shared for all step ids. So, it seems that the tensor content is being changed by another thread while RecvFromRemoteAsync is still in progress. Is it possible? How do I know I receive the correct tensor?
EDIT - How to reproduce:
Take this branch. If you prefer, the error reproducing code is in the last commit and it can probably be cherry-picked with no conflicts.
Get tensorflow benchmarks
Run tf_cnn_benchmarks.py with at least 1 ps and 2 workers.
The commands I used:
python -u tf_cnn_benchmarks.py --job_name=ps --task_index=0 --ps_hosts=<...> --worker_hosts=<...> --server_protocol=grpc --model=resnet152 --batch_size=32 --num_gpus=2 --local_parameter_device=gpu
python -u tf_cnn_benchmarks.py --job_name=worker --task_index=0 --ps_hosts=<...> --worker_hosts=<...> --server_protocol=grpc --model=resnet152 --batch_size=32 --num_gpus=2 --local_parameter_device=gpu
python -u tf_cnn_benchmarks.py --job_name=worker --task_index=1 --ps_hosts=<...> --worker_hosts=<...> --server_protocol=grpc --model=resnet152 --batch_size=32 --num_gpus=2 --local_parameter_device=gpu

Related

yolov7,no mask with the output image

i git yolov7(https://github.com/WongKinYiu) with yolov7.pt and try to run
detect.py(i just want to run the example). it seems to be normal. but the output image has no mask.Why?
here is my code and log:
(PyTorch) E:\yolov7>python detect.py --weights yolov7.pt --source inference\images\bus.jpg
Namespace(weights=['yolov7.pt'], source='inference\\images\\bus.jpg', img_size=640, conf_thres=0.25, iou_thres=0.45, device='', view_img=False, save_txt=False, save_conf=False, nosave=False, classes=None, agnostic_nms=False, augment=False, update=False, project='runs/detect', name='exp', exist_ok=False, no_trace=False)
YOLOR v0.1-103-g6ded32c torch 1.11.0 CUDA:0 (NVIDIA GeForce GTX 1650, 4095.6875MB)
Fusing layers...
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
Model Summary: 306 layers, 36905341 parameters, 6652669 gradients
Convert model to Traced-model...
traced_script_module saved!
model is traced!
E:\anaconda\envs\PyTorch\lib\site-packages\torch\functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Done. (151.6ms) Inference, (9.3ms) NMS
The image with the result is saved in: runs\detect\exp4\bus.jpg
Done. (3.713s)
and here is my result:output image
You set the argument classes=None.
The classes variable refers to a list of classes, where you define the index of the entities saved inside the weights you are referencing for the inference.
From detect.py:
parser.add_argument('--classes', nargs='+', type=int, help='filter by class: --class 0, or --class 0 2 3')
Since you told the model to check for zero classes, the model itself will not report anything.
I was also facing this issue. After downgrading cuda version to 10.2 my problem was solved. I used Cuda 10.2 with PyTorch 1.10.0 via pip installation. I hope it helps you too.
pip3 install torch==1.10.0+cu102 torchvision==0.11.1+cu102 torchaudio===0.10.0+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html
Source of the answer: link
Since when you are working with GPU, It allows half precision by default, which you can change by editing your detect.py file.
Go to detect.py file and not exactly sure but on line 31, you will see this line of code:
half = device.type != 'cpu' # half precision only supported on CUDA
Replace that line with
half = False
and then save default.py file.
Now when you are using your detection command, make sure to use --device 0 that indicates your GPU must be utilize for detection.
python detect.py --weights yolov7.pt --device 0 --source inference\images\bus.jpg

Testing a Jupyter Notebook

I am trying to come up with a method to test a number of Jupyter notebooks. A test should run when a new notebook is implemented in a Github branch and submitted for a pull request. The tests are not that complicated, they are mostly just testing if the notebook runs end-to-end and without any errors, and maybe a few asserts. However:
There are certain calls in some cells that need to be mocked, e.g. a call to download the data from a database.
There may be some magic cells in the notebooks which run a pip command or something else.
I am open to use any testing library, such as 'pytest' or unittest, although pytest is preferred.
I looked at a few libraries for testing notebooks such as nbmake, treon, and testbook, but I was unable to make them work. I also tried to convert the notebook to a python file, but the magic cells were converted to a get_ipython().run_cell_magic(...) call which became an issue, since pytest uses python and not ipython, and get_ipython() is only available in ipython.
So, I am wondering what is a good way to test jupyter notebooks with all of that in mind. Any help is appreciated.
One straightforward approach I've already used is to execute the entire notebook with nbconvert.
A notebook failed.ipynb raising an exception will result in a failed run thanks to the --execute option that tells nbconvert to execute the notebook prior to its conversion.
jupyter nbconvert --to notebook --execute failed.ipynb
# ...
# Exception: FAILED
echo $?
# 1
Another correct notebook passed.ipynb will result in a successful export.
jupyter nbconvert --to notebook --execute passed.ipynb
# [NbConvertApp] Converting notebook passed.ipynb to notebook
# [NbConvertApp] Writing 1172 bytes to passed.nbconvert.ipynb
echo $?
# 0
Cherry on the cake, you can do the same through the API and so wrap it in Pytest!
import nbformat
import pytest
from nbconvert.preprocessors import ExecutePreprocessor
#pytest.mark.parametrize("notebook", ["passed.ipynb", "failed.ipynb"])
def test_notebook_exec(notebook):
with open(notebook) as f:
nb = nbformat.read(f, as_version=4)
ep = ExecutePreprocessor(timeout=600, kernel_name='python3')
try:
assert ep.preprocess(nb) is not None, f"Got empty notebook for {notebook}"
except Exception:
assert False, f"Failed executing {notebook}"
Running the test gives.
pytest test_nbconv.py
# FAILED test_nbconv.py::test_notebook_exec[failed.ipynb] - AssertionError: Failed executing failed.ipynb
# PASSED test_nbconv.py::test_notebook_exec[passed.ipynb]
Notes
There is several output formats, I've used here notebook.
This doesn’t convert a notebook to a different format per se, instead it allows the running of nbconvert preprocessors on a notebook, and/or conversion to other notebook formats.
The python code example is just a quick draft it can be largely improved.
Here is my own solution using testbook. Let's say I have a notebook called my_notebook.ipynb with the following content:
The trick is to inject a cell before my call to bigquery.Client and mock it:
from testbook import testbook
#testbook('./my_notebook.ipynb')
def test_get_details(tb):
tb.inject(
"""
import mock
mock_client = mock.MagicMock()
mock_df = pd.DataFrame()
mock_df['week'] = range(10)
mock_df['count'] = 5
p1 = mock.patch.object(bigquery, 'Client', return_value=mock_client)
mock_client.query().result().to_dataframe.return_value = mock_df
p1.start()
""",
before=2,
run=False
)
tb.execute()
dataframe = tb.get('dataframe')
assert dataframe.shape == (10, 2)
x = tb.get('x')
assert x == 7

some question about grpc+gdr and grpc+verbs in using distributed tensorflow

when i use distributed tensorflow, grpc+gdr is worse than grpc+verbs, but nv_peer_mem is loaded,and i don't know the difference of grpc+verbs and grpc+gdr? anyone can help me?
and some output is as below:
root#s36-2288H-V5:~# /etc/init.d/nv_peer_mem status
nv_peer_mem module is loaded.
my start code is as below:
python /root/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--server_protocol=grpc+verbs
--model=vgg16 --variable_update=parameter_server
--batch_size=64 --num_batches=50 --num_warmup_batches=10
--local_parameter_device=gpu --num_gpus=1
--job_name=ps --task_index=0
--ps_hosts=172.168.30.25:10011
--worker_hosts=172.168.30.26:50012 &
and when i set --server_protocol = grpc+gdr, the performacnce is worse.

How periodicaly evaluate the Performance of Models in TF-Slim?

I am trying to use DensNet for regression problem with TF-Slim. My data contains 60000 jpeg images with 37 float labels for each image. I divided my data into three different tfrecords files of a train set (60%), a validation set (20%) and a test set (20%).
I need to evaluate validation set during training loop and make a plot like image.
In TF-Slim documentation they just explain train loop and evaluation loop separately. I can just evaluate validation or test set after training loop finished. While as I said I need to evaluate during training.
I tried to use slim.evaluation.evaluation_loop function instead of slim.evaluation.evaluate_once. But it doesn't help.
slim.evaluation.evaluation_loop(
master=FLAGS.master,
checkpoint_dir=checkpoint_path,
logdir=FLAGS.eval_dir,
num_evals=num_batches,
eval_op=list(names_to_updates.values()) + print_ops,
variables_to_restore=variables_to_restore,
summary_op = tf.summary.merge(summary_ops),
eval_interval_secs = eval_interval_secs )
I tried evaluation.evaluate_repeatedly as well.
from tensorflow.contrib.training.python.training import evaluation
evaluation.evaluate_repeatedly(
master=FLAGS.master,
checkpoint_dir=checkpoint_path,
eval_ops=list(names_to_updates.values()) + print_ops,
eval_interval_secs = eval_interval_secs )
In both of these functions, they just read the latest available checkpoint from checkpoint_dir and apparently waiting for the next one, however when the new checkpoints are generated, they don't perform at all.
I use Python 2.7.13 and Tensorflow 1.3.0 on CPU.
Any help will be highly appreciated.
Using evaluate_once works just fine with bash script using sleep. Appears that Tensorboard is capable plotting multiple single runs from given eval_dir...
So I use something like:
#!/bin/bash
set -e
# Paths to model and evaluation results
TRAIN_DIR=~/pDL/tensorflow/model/mobilenet_v1_1_224_rp-v1/run0004
TEST_DIR=${TRAIN_DIR}/eval
# Where the dataset is saved to.
DATASET_DIR=/mnt/data/tensorflow/data
# Run evaluation (using slim.evaluation.evaluate_once)
CONTINUE=1
while [ "$CONTINUE" -ne 0 ]
do
python eval_image_classifier.py \
--checkpoint_path=${TRAIN_DIR} \
--eval_dir=${TEST_DIR} \
--dataset_name=master_db \
--preprocessing_name=preprocess224 \
--dataset_split_name=valid \
--dataset_dir=${DATASET_DIR} \
--model_name=mobilenet_v1 \
--patch_size=64
echo "sleeping for next run"
sleep 600
done
This appear to be issue of setting the checkpoint_path properly as addressed here:
https://github.com/tensorflow/tensorflow/issues/13769
Where the answer is by Ellie68 setting:
if tf.gfile.IsDirectory(FLAGS.checkpoint_path):
if tf.train.latest_checkpoint(FLAGS.checkpoint_path):
checkpoint_path = tf.train.latest_checkpoint(FLAGS.checkpoint_path)
else:
checkpoint_path = FLAGS.checkpoint_path

Running Tensorflow on JupyterNotebook instead of on Terminal commands

I wish to run some Tensorflow code on JupyterNotebook.
If run it on terminal, then the link above gives instructions like this:
python src/validate_on_lfw.py ~/datasets/lfw/lfw_mtcnnpy_160 ~/models/facenet/20170512-110547
Question: how do I run it on Jupyter notebook ? Thanks
e.g.,
# Load the model
facenet.load_model(args.model)
Simply replace args.model with ~/models/facenet/20170512-110547
# Load the model
facenet.load_model('~/models/facenet/20170512-110547')
will give error
usage: ipykernel_launcher.py [-h] [--lfw_batch_size LFW_BATCH_SIZE]
[--image_size IMAGE_SIZE] [--lfw_pairs LFW_PAIRS]
[--lfw_file_ext {jpg,png}]
[--lfw_nrof_folds LFW_NROF_FOLDS]
lfw_dir model
ipykernel_launcher.py: error: too few arguments
sys.argv
Out[5]:
['/anaconda/envs/tensorflow/lib/python2.7/site-packages/ipykernel_launcher.py',
'-f',
'/Users/my_name/Library/Jupyter/runtime/kernel-770c12c9-8fbe-44f7-91dd-4b0a5c5d7537.json']
Ok, simple solution...
Simply run it on Terminal as the given GitHub suggested and in the mean time print out the sys.argv on terminal like this
sys.argv = ['src/validate_on_lfw.py', '/Users/../datasets/lfw/lfw_mtcnnpy_160', '/Users/../models/facenet/20170512-110547']
Then use these values of sys.argv in JupyterNotebook in def parse_arguments(argv) as default values, and it worked