Unable to import modules when using Weights and Biases' sweeps - numpy

I'm trying to improve my keras neural network hyperparameters by optimizing them with the weights and biases library (wandb).
Here is my configuration:
method: bayes
metric:
goal: maximize
name: Search elo
parameters:
batch_number:
distribution: int_uniform
max: 100
min: 1
batch_size:
distribution: int_uniform
max: 1024
min: 1
epochs:
distribution: int_uniform
max: 10
min: 1
neural_net_blocks:
distribution: int_uniform
max: 5
min: 1
num_simulations:
distribution: int_uniform
max: 800
min: 1
pb_c_base:
distribution: int_uniform
max: 25000
min: 15000
pb_c_init:
distribution: uniform
max: 3
min: 1
root_dirichlet_alpha:
distribution: uniform
max: 4
min: 0
root_exploration_fraction:
distribution: uniform
max: 1
min: 0
program: ../Main.py
However, when I run wandb agent arkleseisure/projectname/sweepcode, I get this error, repeated every time a sweep launches.
2020-09-13 12:15:02,188 - wandb.wandb_agent - INFO - Running runs: ['klawqpqv']
2020-09-13 12:15:02,189 - wandb.wandb_agent - INFO - Cleaning up finished run: klawqpqv
2020-09-13 12:15:03,063 - wandb.wandb_agent - INFO - Agent received command: run
2020-09-13 12:15:03,063 - wandb.wandb_agent - INFO - Agent starting run with config:
batch_number: 75
batch_size: 380
epochs: 10
neural_net_blocks: 4
num_simulations: 301
pb_c_base: 17138
pb_c_init: 1.5509741790555416
root_dirichlet_alpha: 2.7032316257955133
root_exploration_fraction: 0.5768106739703028
2020-09-13 12:15:03,245 - wandb.wandb_agent - INFO - About to run command: python ../Main.py --batch_number=75 --batch_size=380 --epochs=10 --neural_net_blocks=4 --num_simulations=301 --p
b_c_base=17138 --pb_c_init=1.5509741790555416 --root_dirichlet_alpha=2.7032316257955133 --root_exploration_fraction=0.5768106739703028
Traceback (most recent call last):
File "../Main.py", line 3, in <module>
import numpy
ModuleNotFoundError: No module named 'numpy'
The sweep crashes after three failed attempts, and I was wondering what I am doing wrong. Surely when W & B is made for machine learning projects, it must be possible to import numpy, so what can I change. My code before that point just imports other files from my project. When I run the code normally, it doesn't crash, but executes perfectly ordinarily.

The most likely problem you are running into is that wandb agent is running the python script with a different python interpreter than you were intending.
The solution is to specify the python interpreter by adding something like to the sweep configuration (where python3 is the interpreter you wish to use):
command:
- ${env}
- python3
- ${program}
- ${args}
This is feature is documented at: https://docs.wandb.com/sweeps/configuration#command
And there is a FAQ for setting the python interpreter at:
https://docs.wandb.com/sweeps/faq#sweep-with-custom-commands
To understand a bit more about what is going on you can look at the debugging line that you posted that says: "About to run command:"
python ../Main.py --batch_number=75 --batch_size=380 --epochs=10 --neural_net_blocks=4 --num_simulations=301 --pb_c_base=17138 --pb_c_init=1.5509741790555416 --root_dirichlet_alpha=2.7032316257955133 --root_exploration_fraction=0.5768106739703028
By default wandb agent uses a python interpreter named python. This allows users to customize their environment so python points to their interpreter of choice by using pyenv, virtualenv or other tools.
If you typically run commands with the command-line python2 or python3, you can customize how the agent executes your program by specifying the command key in your configuration file as described above. Alternatively, if your program is executable and your python interpreter is in the first line of your script using #!/usr/bin/env python3 syntax, you can set your command array to be:
command:
- ${env}
- ${program}
- ${args}

Related

yolov7,no mask with the output image

i git yolov7(https://github.com/WongKinYiu) with yolov7.pt and try to run
detect.py(i just want to run the example). it seems to be normal. but the output image has no mask.Why?
here is my code and log:
(PyTorch) E:\yolov7>python detect.py --weights yolov7.pt --source inference\images\bus.jpg
Namespace(weights=['yolov7.pt'], source='inference\\images\\bus.jpg', img_size=640, conf_thres=0.25, iou_thres=0.45, device='', view_img=False, save_txt=False, save_conf=False, nosave=False, classes=None, agnostic_nms=False, augment=False, update=False, project='runs/detect', name='exp', exist_ok=False, no_trace=False)
YOLOR v0.1-103-g6ded32c torch 1.11.0 CUDA:0 (NVIDIA GeForce GTX 1650, 4095.6875MB)
Fusing layers...
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
Model Summary: 306 layers, 36905341 parameters, 6652669 gradients
Convert model to Traced-model...
traced_script_module saved!
model is traced!
E:\anaconda\envs\PyTorch\lib\site-packages\torch\functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Done. (151.6ms) Inference, (9.3ms) NMS
The image with the result is saved in: runs\detect\exp4\bus.jpg
Done. (3.713s)
and here is my result:output image
You set the argument classes=None.
The classes variable refers to a list of classes, where you define the index of the entities saved inside the weights you are referencing for the inference.
From detect.py:
parser.add_argument('--classes', nargs='+', type=int, help='filter by class: --class 0, or --class 0 2 3')
Since you told the model to check for zero classes, the model itself will not report anything.
I was also facing this issue. After downgrading cuda version to 10.2 my problem was solved. I used Cuda 10.2 with PyTorch 1.10.0 via pip installation. I hope it helps you too.
pip3 install torch==1.10.0+cu102 torchvision==0.11.1+cu102 torchaudio===0.10.0+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html
Source of the answer: link
Since when you are working with GPU, It allows half precision by default, which you can change by editing your detect.py file.
Go to detect.py file and not exactly sure but on line 31, you will see this line of code:
half = device.type != 'cpu' # half precision only supported on CUDA
Replace that line with
half = False
and then save default.py file.
Now when you are using your detection command, make sure to use --device 0 that indicates your GPU must be utilize for detection.
python detect.py --weights yolov7.pt --device 0 --source inference\images\bus.jpg

Testing a Jupyter Notebook

I am trying to come up with a method to test a number of Jupyter notebooks. A test should run when a new notebook is implemented in a Github branch and submitted for a pull request. The tests are not that complicated, they are mostly just testing if the notebook runs end-to-end and without any errors, and maybe a few asserts. However:
There are certain calls in some cells that need to be mocked, e.g. a call to download the data from a database.
There may be some magic cells in the notebooks which run a pip command or something else.
I am open to use any testing library, such as 'pytest' or unittest, although pytest is preferred.
I looked at a few libraries for testing notebooks such as nbmake, treon, and testbook, but I was unable to make them work. I also tried to convert the notebook to a python file, but the magic cells were converted to a get_ipython().run_cell_magic(...) call which became an issue, since pytest uses python and not ipython, and get_ipython() is only available in ipython.
So, I am wondering what is a good way to test jupyter notebooks with all of that in mind. Any help is appreciated.
One straightforward approach I've already used is to execute the entire notebook with nbconvert.
A notebook failed.ipynb raising an exception will result in a failed run thanks to the --execute option that tells nbconvert to execute the notebook prior to its conversion.
jupyter nbconvert --to notebook --execute failed.ipynb
# ...
# Exception: FAILED
echo $?
# 1
Another correct notebook passed.ipynb will result in a successful export.
jupyter nbconvert --to notebook --execute passed.ipynb
# [NbConvertApp] Converting notebook passed.ipynb to notebook
# [NbConvertApp] Writing 1172 bytes to passed.nbconvert.ipynb
echo $?
# 0
Cherry on the cake, you can do the same through the API and so wrap it in Pytest!
import nbformat
import pytest
from nbconvert.preprocessors import ExecutePreprocessor
#pytest.mark.parametrize("notebook", ["passed.ipynb", "failed.ipynb"])
def test_notebook_exec(notebook):
with open(notebook) as f:
nb = nbformat.read(f, as_version=4)
ep = ExecutePreprocessor(timeout=600, kernel_name='python3')
try:
assert ep.preprocess(nb) is not None, f"Got empty notebook for {notebook}"
except Exception:
assert False, f"Failed executing {notebook}"
Running the test gives.
pytest test_nbconv.py
# FAILED test_nbconv.py::test_notebook_exec[failed.ipynb] - AssertionError: Failed executing failed.ipynb
# PASSED test_nbconv.py::test_notebook_exec[passed.ipynb]
Notes
There is several output formats, I've used here notebook.
This doesn’t convert a notebook to a different format per se, instead it allows the running of nbconvert preprocessors on a notebook, and/or conversion to other notebook formats.
The python code example is just a quick draft it can be largely improved.
Here is my own solution using testbook. Let's say I have a notebook called my_notebook.ipynb with the following content:
The trick is to inject a cell before my call to bigquery.Client and mock it:
from testbook import testbook
#testbook('./my_notebook.ipynb')
def test_get_details(tb):
tb.inject(
"""
import mock
mock_client = mock.MagicMock()
mock_df = pd.DataFrame()
mock_df['week'] = range(10)
mock_df['count'] = 5
p1 = mock.patch.object(bigquery, 'Client', return_value=mock_client)
mock_client.query().result().to_dataframe.return_value = mock_df
p1.start()
""",
before=2,
run=False
)
tb.execute()
dataframe = tb.get('dataframe')
assert dataframe.shape == (10, 2)
x = tb.get('x')
assert x == 7

create_pretraining_data.py is writing 0 records to tf_examples.tfrecord while training custom BERT model

I am writing a custom BERT model on my own corpus, I generated the vocab file using BertWordPieceTokenizer and then running below code
!python create_pretraining_data.py
--input_file=/content/drive/My Drive/internet_archive_scifi_v3.txt
--output_file=/content/sample_data/tf_examples.tfrecord
--vocab_file=/content/sample_data/sifi_13sep-vocab.txt
--do_lower_case=True
--max_seq_length=128
--max_predictions_per_seq=20
--masked_lm_prob=0.15
--random_seed=12345
--dupe_factor=5
Getting output as :
INFO:tensorflow:*** Reading from input files ***
INFO:tensorflow:*** Writing to output files ***
INFO:tensorflow: /content/sample_data/tf_examples.tfrecord
INFO:tensorflow:Wrote 0 total instances
Not sure why I am always getting 0 instances in tf_examples.tfrecord, what am I doing wrong?
I am using TF version 1.12
FYI..generated vocab file is 290 KB.
It can not read the input file, please use My\ Drive instead of My Drive:
--input_file=/content/drive/My\ Drive/internet_archive_scifi_v3.txt

Exception: Cannot start TensorBoard. in Google Cloud Datalab

I see that Tensorboard process is running. Files are written into the model directory. However, repeatedly I get the Exception: Unable to start Tensorboard. I am using TF.estimator.
I am running my code on Google Cloud Datalab. I have tried changing model directory and restarting the Datalab instance many times. Also tried running killing all running Tensorboard processes. Nothing has worked so far. It was working earlier or once in every 10-15 attempts it magically runs. Whats happening?
This is how I am starting Tensorboard.
from google.datalab.ml import TensorBoard as tb
tb.start(model_dir)
This is how my Estimator is configured.
run_config = tf.estimator.RunConfig(
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tf_random_seed=FLAGS.tf_random_seed,
model_dir=model_dir
)
estimator = tf.estimator.Estimator(model_fn=model_fn,
config=run_config)
Below are the files being written into the model directory by tf.estimator.
eval 8 minutes ago
checkpoint 124 B 9 minutes ago
events.out.tfevents.1559025239.78fe4cbf0fad 603 kB 9 minutes ago
graph.pbtxt 399 kB 12 minutes ago
model.ckpt-1.data-00000-of-00001 261 MB 11 minutes ago
model.ckpt-1.index 811 B 11 minutes ago
model.ckpt-1.meta 170 kB 11 minutes ago
model.ckpt-5.data-00000-of-00001 261 MB 9 minutes ago
model.ckpt-5.index 811 B 9 minutes ago
model.ckpt-5.meta 170 kB 9 minutes ago
The error I am getting is below. It is the same everytime and I have no further information to identify what is going wrong.
Exception Traceback (most recent call >last)
in ()
2 #tensorboard --logdir ./logs/1/train --host localhost --port 8081
3 from google.datalab.ml import TensorBoard as tb
----> 4 tb.start(model_dir)
/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/ml/_tensorboard.py in start(logdir)
77 retry -= 1
78
---> 79 raise Exception('Cannot start TensorBoard.')
80
81 #staticmethod
Exception: Cannot start TensorBoard.
When I list the Tensorboard processes running using below code, below is what I get.
x = tb.list() #Returns a dataframe
print(x)
logdir pid port
0 ./model_no_reuse/2 6236 40269
1 ./model_no_reuse/2 6241 57895
Please help me identify what is going wrong.
I tried increasing the VM configuration from 2 vCPU/4.5 GB to 4 vCPU/20GB and the issue is resolved. It looks like even though Tensorboard process does get started, for it to open up certain minimum resources are needed. Will change the answer if I arrive at any other conclusion.

Distrubuted PyTorch code halts on multiple nodes when using MPI backend

I am trying to run Pytorch code on three nodes using openMPI but the code just halts without any errors or output. Eventually my purpose is to distribute a Pytorch graph on these nodes.
Three of my nodes are connected in same LAN and have SSH access to each other without password and have similar specifications:
Ubuntu 18.04
Cuda 10.0
OpenMPI built and installed from source
PyTorch built and installed from source
The code shown below works on single node - multiple processes, as:
> mpirun -np 3 -H 192.168.100.101:3 python3 run.py
With following output:
INIT 0 of 3 Init env://
INIT 1 of 3 Init env://
INIT 2 of 3 Init env://
RUN 0 of 3 with tensor([0., 0., 0.])
RUN 1 of 3 with tensor([0., 0., 0.])
RUN 2 of 3 with tensor([0., 0., 0.])
Rank 1 has data tensor(1.)
Rank 0 has data tensor(1.)
Rank 2 has data tensor(1.)
But when I placed the code on three nodes and run following command on each node separately, it does nothing:
> mpirun -np 3 -H 192.168.100.101:1,192.168.100.102:1,192.168.100.103:1 python3 run.py
Please give some idea about any modifications in code or configurations for MPI to run given Pytorch code on multiple nodes?
#!/usr/bin/env python
import os
import torch
import torch.distributed as dist
from torch.multiprocessing import Process
def run(rank, size):
tensor = torch.zeros(size)
print(f"RUN {rank} of {size} with {tensor}")
# incrementing the old tensor
tensor += 1
# sending tensor to next rank
if rank == size-1:
dist.send(tensor=tensor, dst=0)
else:
dist.send(tensor=tensor, dst=rank+1)
# receiving tensor from previous rank
if rank == 0:
dist.recv(tensor=tensor, src=size-1)
else:
dist.recv(tensor=tensor, src=rank-1)
print('Rank ', rank, ' has data ', tensor[0])
def init_processes(rank, size, fn, backend, init):
print(f"INIT {rank} of {size} Init {init}")
dist.init_process_group(backend, init, rank=rank, world_size=size)
fn(rank, size)
if __name__ == "__main__":
os.environ['MASTER_ADDR'] = '192.168.100.101'
os.environ['BACKEND'] = 'mpi'
os.environ['INIT_METHOD'] = 'env://'
world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
init_processes(world_rank, world_size, run, os.environ['BACKEND'], os.environ['INIT_METHOD'])
N.B. NCCL is not an option for me due to arm64-based hardware.
Apologies for replying late to this, but I could solve the issue by adding --mca btl_tcp_if_include eth1 flag to mpirun command.
The reason for halt was that openMPI, by default, tries to locate and communicate with other nodes over local loopback network interface e.g. lo. We have to explicitly specify which interface(s) should be included (or excluded) to locate other other nodes.
I hope it would save someone's day :)