Tensorflow SavedModel file size increases with each save - tensorflow

I have a Tensorflow R1.13 training code that saves a SavedModel periodically during a long training run (I am following this excellent article on the topic). I have noticed that each time the model is saved the size increases. In fact it seems that it increases exactly linearly each time, and seems to be a multiple of the initial file size. I wonder if TF is keeping a reference to all previous saved files and accumulating them for each later save. Below are the file sizes for several SavedModel files written in sequence over time, as training progresses.
-rw-rw-r-- 1 ubuntu ubuntu 576962 Apr 15 23:56 ./model_accuracy_0.361/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 1116716 Apr 15 23:58 ./model_accuracy_0.539/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 1656470 Apr 16 00:11 ./model_accuracy_0.811/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 2196440 Apr 16 00:15 ./model_accuracy_0.819/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 2736794 Apr 16 00:17 ./model_accuracy_0.886/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 3277150 Apr 16 00:19 ./model_accuracy_0.908/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 3817530 Apr 16 00:21 ./model_accuracy_0.919/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 4357950 Apr 16 00:25 ./model_accuracy_0.930/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 4898492 Apr 16 00:27 ./model_accuracy_0.937/saved_model.pb
Is there a way to cull out the previous saved versions? Or at least prevent them from being accumulated in the first place? I will certainly only keep the last file, but it seems to be 10x larger than it should be.
Below is my code (largely copied from Silva):
# Creates the TensorInfo protobuf objects that encapsulates the input/output tensors
tensor_info_input_data_1 = tf.saved_model.utils.build_tensor_info(gd.data_1)
tensor_info_input_data_2 = tf.saved_model.utils.build_tensor_info(gd.data_2)
tensor_info_input_keep = tf.saved_model.utils.build_tensor_info(gd.keep )
# output tensor info
tensor_info_output_pred = tf.saved_model.utils.build_tensor_info(gd.targ_pred_oneh)
tensor_info_output_soft = tf.saved_model.utils.build_tensor_info(gd.targ_pred_soft)
# Define the SignatureDef for this export
prediction_signature = \
tf.saved_model.signature_def_utils.build_signature_def(
inputs={
'data_1': tensor_info_input_data_1,
'data_2': tensor_info_input_data_2,
'keep' : tensor_info_input_keep
},
outputs={
'pred_orig': tensor_info_output_pred,
'pred_soft': tensor_info_output_soft
},
method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME)
graph_entry_point_name = "my_model" # The logical name for the model in TF Serving
try:
builder = tf.saved_model.builder.SavedModelBuilder(saved_model_path)
builder.add_meta_graph_and_variables(
sess= sess,
tags=[tf.saved_model.tag_constants.SERVING],
signature_def_map = {graph_entry_point_name:prediction_signature}
)
builder.save(as_text=False)
if verbose:
print(" SavedModel graph written successfully. " )
success = True
except Exception as e:
print(" WARNING::SavedModel write FAILED. " )
traceback.print_tb(e.__traceback__)
success = False
return success

#Hephaestus,
If you're constructing a SavedModelBuilder each time, then it'll add new save operations to the graph every time you save.
Instead, you can construct SavedModelBuilder only once and just call builder.save repeatedly. This will not add new ops to the graph on each save call.
Alternatively I think you can create your own tf.train.Saver and pass it to add_meta_graph_and_variables. Then it shouldn't create any new operations.
A good debugging aid is tf.get_default_graph().finalize() once you're done graph building, which will throw an exception rather than expanding the graph like this.
Hope this helps.

Set clear_extraneous_saver = True for Saver
https://github.com/tensorflow/tensorflow/blob/b78d23cf92656db63bca1f2cbc9636c7caa387ca/tensorflow/python/saved_model/builder_impl.py#L382
meta_graph_def = saver.export_meta_graph(
clear_devices=clear_devices, clear_extraneous_savers=True, strip_default_attrs=strip_default_attrs)

Related

yolov7,no mask with the output image

i git yolov7(https://github.com/WongKinYiu) with yolov7.pt and try to run
detect.py(i just want to run the example). it seems to be normal. but the output image has no mask.Why?
here is my code and log:
(PyTorch) E:\yolov7>python detect.py --weights yolov7.pt --source inference\images\bus.jpg
Namespace(weights=['yolov7.pt'], source='inference\\images\\bus.jpg', img_size=640, conf_thres=0.25, iou_thres=0.45, device='', view_img=False, save_txt=False, save_conf=False, nosave=False, classes=None, agnostic_nms=False, augment=False, update=False, project='runs/detect', name='exp', exist_ok=False, no_trace=False)
YOLOR v0.1-103-g6ded32c torch 1.11.0 CUDA:0 (NVIDIA GeForce GTX 1650, 4095.6875MB)
Fusing layers...
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
Model Summary: 306 layers, 36905341 parameters, 6652669 gradients
Convert model to Traced-model...
traced_script_module saved!
model is traced!
E:\anaconda\envs\PyTorch\lib\site-packages\torch\functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Done. (151.6ms) Inference, (9.3ms) NMS
The image with the result is saved in: runs\detect\exp4\bus.jpg
Done. (3.713s)
and here is my result:output image
You set the argument classes=None.
The classes variable refers to a list of classes, where you define the index of the entities saved inside the weights you are referencing for the inference.
From detect.py:
parser.add_argument('--classes', nargs='+', type=int, help='filter by class: --class 0, or --class 0 2 3')
Since you told the model to check for zero classes, the model itself will not report anything.
I was also facing this issue. After downgrading cuda version to 10.2 my problem was solved. I used Cuda 10.2 with PyTorch 1.10.0 via pip installation. I hope it helps you too.
pip3 install torch==1.10.0+cu102 torchvision==0.11.1+cu102 torchaudio===0.10.0+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html
Source of the answer: link
Since when you are working with GPU, It allows half precision by default, which you can change by editing your detect.py file.
Go to detect.py file and not exactly sure but on line 31, you will see this line of code:
half = device.type != 'cpu' # half precision only supported on CUDA
Replace that line with
half = False
and then save default.py file.
Now when you are using your detection command, make sure to use --device 0 that indicates your GPU must be utilize for detection.
python detect.py --weights yolov7.pt --device 0 --source inference\images\bus.jpg

create_pretraining_data.py is writing 0 records to tf_examples.tfrecord while training custom BERT model

I am writing a custom BERT model on my own corpus, I generated the vocab file using BertWordPieceTokenizer and then running below code
!python create_pretraining_data.py
--input_file=/content/drive/My Drive/internet_archive_scifi_v3.txt
--output_file=/content/sample_data/tf_examples.tfrecord
--vocab_file=/content/sample_data/sifi_13sep-vocab.txt
--do_lower_case=True
--max_seq_length=128
--max_predictions_per_seq=20
--masked_lm_prob=0.15
--random_seed=12345
--dupe_factor=5
Getting output as :
INFO:tensorflow:*** Reading from input files ***
INFO:tensorflow:*** Writing to output files ***
INFO:tensorflow: /content/sample_data/tf_examples.tfrecord
INFO:tensorflow:Wrote 0 total instances
Not sure why I am always getting 0 instances in tf_examples.tfrecord, what am I doing wrong?
I am using TF version 1.12
FYI..generated vocab file is 290 KB.
It can not read the input file, please use My\ Drive instead of My Drive:
--input_file=/content/drive/My\ Drive/internet_archive_scifi_v3.txt

TensorFlow Serving Error: 'StatelessIf has '_lower_using_switch_merge' attr set but it does not support lowering.'

When attempting to serve a new model coded using TensorFlow 2.0 with TensorFlow serving, I get the following error from my Docker container logs:
2019-09-03 08:56:24.984824: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/model_modeFact/1567500955
2019-09-03 08:56:24.989902: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2019-09-03 08:56:25.002593: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: fail. Took 17772 microseconds.
2019-09-03 08:56:25.002658: E tensorflow_serving/util/retrier.cc:37] Loading servable: {name: model_modeFact version: 1567500955} failed: Internal: Node {{node zero_fraction/total_zero/zero_count/else/_1/zero_fraction/cond}} of type StatelessIf has '_lower_using_switch_merge' attr set but it does not support lowering.
Using the saved_model_cli, the model works fine and can make predictions.
Initially I was getting this error: "TensorFlow Serving crossed columns strange error"
I found that this error might be fixed by swapping to tf-nightly-2.0-preview==2.0.0.dev20190819
But instead I am now I can't even get my model to be served.
The only changes I made to the code to compile my model in TF2 are:
# Added this line to force eager execution, necessary for tf.placeholders
tf.compat.v1.disable_eager_execution()
# For every usage of tf.estimator...
tf.compat.v1.estimator
# For every usage of tf.placeholder...
tf.compat.v1.placeholder
Like the previous problem, the goal is to have a prediction output from my served model, an output similar to when I use saved_model_cli. Something like this:
Result for output key all_class_ids:
[[0 1 2 3 4 5]]
Result for output key all_classes:
[[b'0' b'1' b'2' b'3' b'4' b'5']]
Result for output key class_ids:
[[2]]
Result for output key classes:
[[b'2']]
Result for output key logits:
[[ 0.11128154 -0.44881764 0.31520572 -0.08318427 -0.3479367 -0.08883157]]
Result for output key probabilities:
[[0.19719791 0.11263006 0.2418051 0.16234797 0.12458517 0.16143374]]
Most probably it happens because you are using TF2 docker image, try
tensorflow/serving:1.15.0-rc2
docker image, I hope it fixes this problem .
Try also calling tf.compat.v1.disable_v2_behavior() when app that is saving your model is started.

Exception: Cannot start TensorBoard. in Google Cloud Datalab

I see that Tensorboard process is running. Files are written into the model directory. However, repeatedly I get the Exception: Unable to start Tensorboard. I am using TF.estimator.
I am running my code on Google Cloud Datalab. I have tried changing model directory and restarting the Datalab instance many times. Also tried running killing all running Tensorboard processes. Nothing has worked so far. It was working earlier or once in every 10-15 attempts it magically runs. Whats happening?
This is how I am starting Tensorboard.
from google.datalab.ml import TensorBoard as tb
tb.start(model_dir)
This is how my Estimator is configured.
run_config = tf.estimator.RunConfig(
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tf_random_seed=FLAGS.tf_random_seed,
model_dir=model_dir
)
estimator = tf.estimator.Estimator(model_fn=model_fn,
config=run_config)
Below are the files being written into the model directory by tf.estimator.
eval 8 minutes ago
checkpoint 124 B 9 minutes ago
events.out.tfevents.1559025239.78fe4cbf0fad 603 kB 9 minutes ago
graph.pbtxt 399 kB 12 minutes ago
model.ckpt-1.data-00000-of-00001 261 MB 11 minutes ago
model.ckpt-1.index 811 B 11 minutes ago
model.ckpt-1.meta 170 kB 11 minutes ago
model.ckpt-5.data-00000-of-00001 261 MB 9 minutes ago
model.ckpt-5.index 811 B 9 minutes ago
model.ckpt-5.meta 170 kB 9 minutes ago
The error I am getting is below. It is the same everytime and I have no further information to identify what is going wrong.
Exception Traceback (most recent call >last)
in ()
2 #tensorboard --logdir ./logs/1/train --host localhost --port 8081
3 from google.datalab.ml import TensorBoard as tb
----> 4 tb.start(model_dir)
/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/ml/_tensorboard.py in start(logdir)
77 retry -= 1
78
---> 79 raise Exception('Cannot start TensorBoard.')
80
81 #staticmethod
Exception: Cannot start TensorBoard.
When I list the Tensorboard processes running using below code, below is what I get.
x = tb.list() #Returns a dataframe
print(x)
logdir pid port
0 ./model_no_reuse/2 6236 40269
1 ./model_no_reuse/2 6241 57895
Please help me identify what is going wrong.
I tried increasing the VM configuration from 2 vCPU/4.5 GB to 4 vCPU/20GB and the issue is resolved. It looks like even though Tensorboard process does get started, for it to open up certain minimum resources are needed. Will change the answer if I arrive at any other conclusion.

Distrubuted PyTorch code halts on multiple nodes when using MPI backend

I am trying to run Pytorch code on three nodes using openMPI but the code just halts without any errors or output. Eventually my purpose is to distribute a Pytorch graph on these nodes.
Three of my nodes are connected in same LAN and have SSH access to each other without password and have similar specifications:
Ubuntu 18.04
Cuda 10.0
OpenMPI built and installed from source
PyTorch built and installed from source
The code shown below works on single node - multiple processes, as:
> mpirun -np 3 -H 192.168.100.101:3 python3 run.py
With following output:
INIT 0 of 3 Init env://
INIT 1 of 3 Init env://
INIT 2 of 3 Init env://
RUN 0 of 3 with tensor([0., 0., 0.])
RUN 1 of 3 with tensor([0., 0., 0.])
RUN 2 of 3 with tensor([0., 0., 0.])
Rank 1 has data tensor(1.)
Rank 0 has data tensor(1.)
Rank 2 has data tensor(1.)
But when I placed the code on three nodes and run following command on each node separately, it does nothing:
> mpirun -np 3 -H 192.168.100.101:1,192.168.100.102:1,192.168.100.103:1 python3 run.py
Please give some idea about any modifications in code or configurations for MPI to run given Pytorch code on multiple nodes?
#!/usr/bin/env python
import os
import torch
import torch.distributed as dist
from torch.multiprocessing import Process
def run(rank, size):
tensor = torch.zeros(size)
print(f"RUN {rank} of {size} with {tensor}")
# incrementing the old tensor
tensor += 1
# sending tensor to next rank
if rank == size-1:
dist.send(tensor=tensor, dst=0)
else:
dist.send(tensor=tensor, dst=rank+1)
# receiving tensor from previous rank
if rank == 0:
dist.recv(tensor=tensor, src=size-1)
else:
dist.recv(tensor=tensor, src=rank-1)
print('Rank ', rank, ' has data ', tensor[0])
def init_processes(rank, size, fn, backend, init):
print(f"INIT {rank} of {size} Init {init}")
dist.init_process_group(backend, init, rank=rank, world_size=size)
fn(rank, size)
if __name__ == "__main__":
os.environ['MASTER_ADDR'] = '192.168.100.101'
os.environ['BACKEND'] = 'mpi'
os.environ['INIT_METHOD'] = 'env://'
world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
init_processes(world_rank, world_size, run, os.environ['BACKEND'], os.environ['INIT_METHOD'])
N.B. NCCL is not an option for me due to arm64-based hardware.
Apologies for replying late to this, but I could solve the issue by adding --mca btl_tcp_if_include eth1 flag to mpirun command.
The reason for halt was that openMPI, by default, tries to locate and communicate with other nodes over local loopback network interface e.g. lo. We have to explicitly specify which interface(s) should be included (or excluded) to locate other other nodes.
I hope it would save someone's day :)