I have some files that include image filepaths and features, and some of the images may be missing or corrupt. I'm wondering how to robustly handle errors, by skipping these images and removing them from the queue.
I notice that simply catching the error and continuing will cause the queue to output the same image, so it will repeatedly error out on the same image. Is there a way to dequeue the image on error?
Also, I have a 'tf.Print()' statement to log the filename, but the 'Result:' line in my log shows that the valid image was processed with no corresponding print output. Why does 'tf.Print()' only print the name of the nonexistent file, not the correctly processed file?
Below is a small example, with the same error-handling code as my larger program:
Code:
#!/usr/bin/python3
import tensorflow as tf
example_filename = 'example.csv'
max_iterations = 20
### Create the graph ###
filename_container_queue = tf.train.string_input_producer([ example_filename ])
filename_container_reader = tf.TextLineReader()
_, filename_container_contents = filename_container_reader.read(filename_container_queue)
image_filenames = tf.decode_csv(filename_container_contents, [ tf.constant('', shape=[1], dtype=tf.string) ])
# decode_jpeg only works on a single image at a time
image_filename_batch = tf.train.shuffle_batch([ image_filenames ], batch_size=1, capacity=100, min_after_dequeue=0)
image_filename = tf.reshape(image_filename_batch, [1])
image_filenames_queue = tf.train.string_input_producer(image_filename)
image_reader = tf.WholeFileReader()
_, image_contents = image_reader.read(image_filenames_queue)
image = tf.image.decode_jpeg(tf.Print(image_contents, [ image_filename ]), channels=3)
counter = tf.count_up_to(tf.Variable(tf.constant(0)), max_iterations)
result_op = tf.reduce_mean(tf.image.convert_image_dtype(image, tf.float32), [0,1]) # Output average Red, Green, Blue values.
init_op = tf.initialize_all_variables()
### Run the graph ###
print("Running graph")
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
sess.run([ init_op ])
n = 0
try:
while not coord.should_stop():
try:
result, n = sess.run([ result_op, counter ])
print("Result:", result)
except tf.errors.NotFoundError as e:
print("Skipping file due to image not existing")
# coord.request_stop(e) <--- We only want to skip, not stop the entire process.
except tf.errors.OutOfRangeError as e:
print('Done training -- epoch limit reached after %d iterations' % n)
coord.request_stop(e)
finally:
coord.request_stop()
coord.join(threads)
Data:
example.csv contains:
/home/mburge/Pictures/junk/109798.jpg
nonexistent.jpg
Program Output:
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Running graph
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning N
UMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.8475
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 6.83GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
I tensorflow/core/kernels/logging_ops.cc:79] [nonexistent.jpg]
Result: [ 0.33875707 0.39879724 0.28882763]
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
W tensorflow/core/framework/op_kernel.cc:968] Not found: nonexistent.jpg
[[Node: ReaderRead_1 = ReaderRead[_class=["loc:#WholeFileReader", "loc:#input_producer_1"], _device="/job:localhost/replica:0/task:0/cpu:0"](WholeFileReader, input_produ
cer_1)]]
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Skipping file due to image not existing
Done training -- epoch limit reached after 0 iterations
You can manually define a dequeue op:
filename_deq = image_filenames_queue.dequeue()
and later, if you find a problem with reading a file, dequeue that file from the filename queue:
except tf.errors.NotFoundError as e:
print("Skipping file due to image not existing")
sess.run(filename_deq)
Related
I trained a custom MobileNetV2 SSD model for object detection. I saved the .pb file and now I want to convert it into a .tflite-file in order to use it with Coral edge-tpu.
I use Tensorflow 2.2 on Windows 10 on CPU.
The code I'm using:
import tensorflow as tf
saved_model_dir = r"C:/Tensorflow/Backup_Training/my_MobileNetV2_fpnlite_320x320/saved_model"
num_calibration_steps = 100
def representative_dataset_gen():
for _ in range(num_calibration_steps):
# Get sample input data as a numpy array
yield [np.random.uniform(0.0, 1.0, size=(1,416,416, 3)).astype(np.float32)]
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.experimental_new_converter = True
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [
#tf.lite.OpsSet.TFLITE_BUILTINS_INT8,
tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.SELECT_TF_OPS
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_quant_model = converter.convert()
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
I tried several proposed solutions from other threads and I also tried it with tf-nightly, tf2.3 & tf1.14 but none of it worked (there was always another error message I couldn't handle). Since I trained with tf2.2 I thought it might be a good idea to proceed with tf2.2.
Since I'm new to Tensorflow I have several questions: what exactly is the input tensor and where do I define it? Is there a possibility to see or extract this input tensor?
Does anybody know how to fix this issue?
The whole error message:
(tf22) C:\Tensorflow\Backup_Training>python full_int_quant.py
2020-10-22 14:51:20.460948: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-10-22 14:51:20.466366: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-10-22 14:51:29.231404: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2020-10-22 14:51:29.239003: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-10-22 14:51:29.250497: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: ip3536
2020-10-22 14:51:29.258432: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: ip3536
2020-10-22 14:51:29.269261: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-10-22 14:51:29.291457: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2ae2ac3ffc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-22 14:51:29.298043: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-10-22 14:52:03.785341: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-10-22 14:52:03.790251: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-10-22 14:52:04.559832: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-10-22 14:52:04.564529: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] function_optimizer: Graph size after: 3672 nodes (3263), 5969 edges (5553), time = 136.265ms.
2020-10-22 14:52:04.570187: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] function_optimizer: function_optimizer did nothing. time = 2.637ms.
2020-10-22 14:52:10.742013: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-10-22 14:52:10.746868: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-10-22 14:52:12.358897: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-10-22 14:52:12.363657: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] constant_folding: Graph size after: 1714 nodes (-1958), 2661 edges (-3308), time = 900.347ms.
2020-10-22 14:52:12.369137: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799] constant_folding: Graph size after: 1714 nodes (0), 2661 edges (0), time = 60.628ms.
Traceback (most recent call last):
File "full_int_quant.py", line 40, in <module>
tflite_model = converter.convert()
File "C:\Users\schulzyk\Anaconda3\envs\tf22\lib\site-packages\tensorflow\lite\python\lite.py", line 480, in convert
raise ValueError(
ValueError: None is only supported in the 1st dimension. Tensor 'input_tensor' has invalid shape '[1, None, None, 3]'.
Whatever I change in the code, there is always the same error message. I don't know if this is a sign that during training something went wrong but there were no eye-catching occurences.
I'd be happy for any kind of feedback!
Ahh, object detection API with tensorflow 2.0 for coral is still a WIP. We are having many roadblocks and may not see this feature soon. I suggest using tf1.x aPI for now. Here is a good tutorial :)
https://github.com/Namburger/edgetpu-ssdlite-mobiledet-retrain
I followed the example from official tensorflow website.
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
Here is my spec
WSL
Ubuntu 16.04.6 LTS
Tensorflow2.0
No-GPU available
I have a file called 'tfexample.py' which looks like this
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow_datasets as tfds
import tensorflow as tf
import json, os
tfds.disable_progress_bar()
os.environ["TF_CONFIG"] = json.dumps(
{
"cluster": {"worker": ["localhost:12345", "localhost:23456"]},
"task": {"type": "worker", "index": 0},
}
)
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
BUFFER_SIZE = 10000
BATCH_SIZE = 64
def make_datasets_unbatched():
# Scaling MNIST data from (0, 255] to (0., 1.]
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
datasets, info = tfds.load(name="mnist", with_info=True, as_supervised=True)
return datasets["train"].map(scale).cache().shuffle(BUFFER_SIZE)
train_datasets = make_datasets_unbatched().batch(BATCH_SIZE)
def build_and_compile_cnn_model():
model = tf.keras.Sequential(
[
tf.keras.layers.Conv2D(32, 3, activation="relu", input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax"),
]
)
model.compile(
loss=tf.keras.losses.sparse_categorical_crossentropy,
optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
metrics=["accuracy"],
)
return model
# single_worker_model = build_and_compile_cnn_model()
# single_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)
NUM_WORKERS = 2
# Here the batch size scales up by number of workers since
# `tf.data.Dataset.batch` expects the global batch size. Previously we used 64,
# and now this becomes 128.
GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
with strategy.scope():
# Creation of dataset, and model building/compiling need to be within
# `strategy.scope()`.
train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
multi_worker_model = build_and_compile_cnn_model()
# Keras' `model.fit()` trains the model with specified number of epochs and
# number of steps per epoch. Note that the numbers here are for demonstration
# purposes only and may not sufficiently produce a model with good quality.
multi_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)
When I run this file with
python tfexample.py
The terminal just hangs like below
2020-02-04 17:50:23.483411: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485194: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485747: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/danny/.local/lib/python2.7/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
2020-02-04 17:50:29.013263: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-04 17:50:29.014152: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-04 17:50:29.014781: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (WINDOWS-6DFFM0Q): /proc/driver/nvidia/version does not exist
2020-02-04 17:50:29.015780: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-04 17:50:29.025575: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2701000000 Hz
2020-02-04 17:50:29.027050: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x66b11a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-04 17:50:29.027669: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
E0204 17:50:29.038614800 24084 socket_utils_common_posix.cc:198] check for SO_REUSEPORT: {"created":"#1580856629.038575000","description":"Protocol not available","errno":92,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":175,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
E0204 17:50:29.039313500 24084 socket_utils_common_posix.cc:299] setsockopt(TCP_USER_TIMEOUT) Protocol not available
2020-02-04 17:50:29.051180: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12345, 1 -> localhost:23456}
2020-02-04 17:50:29.053392: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:12345
Any help will be appreciated!
Are you running the tfexample.py on two sessions with the correct TFconfig. I haven't tried two instances on the same machine
This problem is due MultiWorkerMirroredStrategy() needs as much different physical devices as number of workers you want to run. If you want to run your script in your local machine you can run each worker in a different Docker container.
I'm attempting to get this PyTorch person detection example:
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
running locally with a GPU, either in a Jupyter Notebook or a regular python file. I get the error in the title either way.
I'm using Ubuntu 18.04. Here is a summary of the steps I've performed:
1) Stock Ubuntu 18.04 install on a Lenovo ThinkPad X1 Extreme Gen 2 with a GTX 1650 GPU.
2) Perform a standard CUDA 10.0 / cuDNN 7.4 install. I'd rather not restate all the steps as this post is going to be more than long enough already. This is a standard procedure, pretty much any link found via googling is what I followed.
3) Install torch and torchvision
pip3 install torch torchvision
4) From this link on the PyTorch site:
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
I've both saved the linked notebook:
https://colab.research.google.com/github/pytorch/vision/blob/temp-tutorial/tutorials/torchvision_finetuning_instance_segmentation.ipynb
And Also tried the link at the bottom that has the regular Python file:
https://pytorch.org/tutorials/_static/tv-training-code.py
5) Before running either the notebook or the regular Python way, I did the following (found at the top of the above linked notebook):
Install the CoCo API into Python:
cd ~
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
open Makefile in gedit, change the two instances of "python" to "python3", then:
python3 setup.py build_ext --inplace
sudo python3 setup.py install
Get the necessary files the above linked files need to run:
cd ~
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.5.0
from ~/vision/references/detection, copy coco_eval.py, coco_utils.py, engine.py, transforms.py, and utils.py to whichever directory the above linked notebook or tv-training-code.py file are being ran from.
6) Download the Penn Fudan Pedestrian dataset from the link on the above page:
https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip
then unzip and put in the same directory as the notebook or tv-training-code.py
In case the above link ever breaks or just for easier reference, here is tv-training-code.py as I have downloaded it at this time:
# Sample code from the TorchVision 0.3 Object Detection Finetuning Tutorial
# http://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
import os
import numpy as np
import torch
from PIL import Image
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from engine import train_one_epoch, evaluate
import utils
import transforms as T
class PennFudanDataset(object):
def __init__(self, root, transforms):
self.root = root
self.transforms = transforms
# load all image files, sorting them to
# ensure that they are aligned
self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))
def __getitem__(self, idx):
# load images ad masks
img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
img = Image.open(img_path).convert("RGB")
# note that we haven't converted the mask to RGB,
# because each color corresponds to a different instance
# with 0 being background
mask = Image.open(mask_path)
mask = np.array(mask)
# instances are encoded as different colors
obj_ids = np.unique(mask)
# first id is the background, so remove it
obj_ids = obj_ids[1:]
# split the color-encoded mask into a set
# of binary masks
masks = mask == obj_ids[:, None, None]
# get bounding box coordinates for each mask
num_objs = len(obj_ids)
boxes = []
for i in range(num_objs):
pos = np.where(masks[i])
xmin = np.min(pos[1])
xmax = np.max(pos[1])
ymin = np.min(pos[0])
ymax = np.max(pos[0])
boxes.append([xmin, ymin, xmax, ymax])
boxes = torch.as_tensor(boxes, dtype=torch.float32)
# there is only one class
labels = torch.ones((num_objs,), dtype=torch.int64)
masks = torch.as_tensor(masks, dtype=torch.uint8)
image_id = torch.tensor([idx])
area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
# suppose all instances are not crowd
iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["masks"] = masks
target["image_id"] = image_id
target["area"] = area
target["iscrowd"] = iscrowd
if self.transforms is not None:
img, target = self.transforms(img, target)
return img, target
def __len__(self):
return len(self.imgs)
def get_model_instance_segmentation(num_classes):
# load an instance segmentation model pre-trained pre-trained on COCO
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
# now get the number of input features for the mask classifier
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
# and replace the mask predictor with a new one
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
hidden_layer,
num_classes)
return model
def get_transform(train):
transforms = []
transforms.append(T.ToTensor())
if train:
transforms.append(T.RandomHorizontalFlip(0.5))
return T.Compose(transforms)
def main():
# train on the GPU or on the CPU, if a GPU is not available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# our dataset has two classes only - background and person
num_classes = 2
# use our dataset and defined transformations
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))
# split the dataset in train and test set
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])
# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=2, shuffle=True, num_workers=4,
collate_fn=utils.collate_fn)
data_loader_test = torch.utils.data.DataLoader(
dataset_test, batch_size=1, shuffle=False, num_workers=4,
collate_fn=utils.collate_fn)
# get the model using our helper function
model = get_model_instance_segmentation(num_classes)
# move model to the right device
model.to(device)
# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
momentum=0.9, weight_decay=0.0005)
# and a learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=3,
gamma=0.1)
# let's train it for 10 epochs
num_epochs = 10
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_test, device=device)
print("That's it!")
if __name__ == "__main__":
main()
Here is an exmaple run of tv-training-code.py
$ python3 tv-training-code.py
Epoch: [0] [ 0/60] eta: 0:01:17 lr: 0.000090 loss: 4.1717 (4.1717) loss_classifier: 0.8903 (0.8903) loss_box_reg: 0.1379 (0.1379) loss_mask: 3.0632 (3.0632) loss_objectness: 0.0700 (0.0700) loss_rpn_box_reg: 0.0104 (0.0104) time: 1.2864 data: 0.1173 max mem: 1865
Traceback (most recent call last):
File "tv-training-code.py", line 165, in <module>
main()
File "tv-training-code.py", line 156, in main
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
File "/xxx/PennFudanExample/engine.py", line 46, in train_one_epoch
losses.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/function.py", line 77, in apply
return self._forward_cls.backward(self, *args)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/function.py", line 189, in wrapper
outputs = fn(ctx, *args)
File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/roi_align.py", line 38, in backward
output_size[0], output_size[1], bs, ch, h, w, sampling_ratio)
RuntimeError: CUDA out of memory. Tried to allocate 132.00 MiB (GPU 0; 3.81 GiB total capacity; 2.36 GiB already allocated; 132.69 MiB free; 310.59 MiB cached) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:267)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fdfb6c9b813 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1ce68 (0x7fdfb6edce68 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1de6e (0x7fdfb6edde6e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x279 (0x7fdf59472789 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[many more frame lines omitted]
Clearly the line:
RuntimeError: CUDA out of memory. Tried to allocate 132.00 MiB (GPU 0; 3.81 GiB total capacity; 2.36 GiB already allocated; 132.69 MiB free; 310.59 MiB cached) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:267)
is the critical error.
If I run an nvidia-smi before a run:
$ nvidia-smi
Tue Dec 24 14:32:49 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 Off | 00000000:01:00.0 On | N/A |
| N/A 47C P8 5W / N/A | 296MiB / 3903MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1190 G /usr/lib/xorg/Xorg 142MiB |
| 0 1830 G /usr/bin/gnome-shell 72MiB |
| 0 3711 G ...uest-channel-token=14371934934688572948 78MiB |
+-----------------------------------------------------------------------------+
It seems pretty clear there is plenty of GPU memory available (this GPU is 4GB).
Moreover, I'm confident my CUDA/cuDNN install and GPU hardware are good b/c I train and inference the TensorFlow object detection API on this computer frequently, and as long as I use the allow_growth option I never have GPU related errors.
From Googling on this error it seems to be relatively common. The most common solutions are:
1) Try a smaller batch size (not really applicable in this case since the training and testing batch sizes are 2 and 1 respectively, and I tried with 1 and 1 and still got the same error)
2) Update to the latest version of PyTorch (but I'm already at the latest version).
Some other suggestions involve reworking the training script. I'm very familiar with TensorFlow but I'm new to PyTorch so I'm not sure how to go about that. Also, most of the rework suggestions I can find for this error do not pertain to object detection and therefore I'm not able to relate them to this training script specifically.
Has anybody else gotten this script to run locally with an NVIDIA GPU? Do you suspect a OS/CUDA/PyTorch configuration concern, or is there someway the script can be reworked to prevent this error? Any assistance would be greatly appreciated.
Very strange, after changing both the training and testing batch size to 1, it now does not crash with a GPU error. Very strange since I'm certain I tried this before.
Perhaps it had something to do with changing the batch size to 1 for both training and testing, and then rebooting or somehow refreshing something else? I'm not really sure. Very odd.
Now the evaluate function call is crashing with the error:
object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
But it seems this is completely unrelated so I'll make a separate post for that.
I'm trying to draw mini-batches from cifar10 binary files.
When implementing my code shown below (see [source code]), the machine (python 3.6) keeps showing the message (see [console]) ans stops.
Does anyone can tell me what is the problem of my source code?
P.S. I'm new to tensorflow..
[source code]-------------------------------------------------------------
import tensorflow as tf
import numpy as np
import os
import matplotlib.pyplot as plt
def _get_image_and_label():
# directory where binary files are stored
data_dir = '/tmp/cifar10_data/cifar-10-batches-bin'
# Step1) make filename Queue
filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i) for i in range(1, 6)]
filename_queue = tf.train.string_input_producer(filenames)
# Step2) read files
label_bytes = 1 # 2 for CIFAR-100
height = 32
width = 32
depth = 3
image_bytes = height * width * depth
record_bytes = label_bytes + image_bytes
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
key, value = reader.read(filename_queue)
# Step3) decode the file in a unit of 1 byte
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
label = tf.cast(tf.strided_slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.strided_slice(record_bytes, [label_bytes], [label_bytes + image_bytes]),
[depth, height, width])
# Convert from [depth, height, width] to [height, width, depth].
uint8image = tf.transpose(depth_major, [1, 2, 0])
# set shape ( image: tf.float32, label: tf.int32 )
image = tf.cast(uint8image, tf.float32)
image.set_shape([height, width, 3])
label.set_shape([1])
# collect batch from the files
# train_x_batch, train_y_batch = tf.train.batch([image, label], batch_size=1)
# return train_x_batch, train_y_batch
return image, label
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
image, label = _get_image_and_label()
for i in range(10):
image_batch, lable_batch = tf.train.batch([image, label], batch_size=1)
image_batch_uint8 = tf.cast(image_batch, tf.uint8)
final_image = sess.run(image_batch_uint8)
imgplot = plt.imshow(final_image[0])
coord.request_stop()
coord.join(threads)
sess.close()
[Console]-----------------------------------------------------------------/home/dooseop/anaconda3/bin/python /home/dooseop/pycharm-community-2016.3.3/helpers/pydev/pydevd.py --multiproc --qt-support --client 127.0.0.1 --port 40623 --file /home/dooseop/PycharmProjects/Tensorflow/CIFAR10_main.py
warning: Debugger speedups using cython not found. Run '"/home/dooseop/anaconda3/bin/python" "/home/dooseop/pycharm-community-2016.3.3/helpers/pydev/setup_cython.py" build_ext --inplace' to build.
Connected to pydev debugger (build 163.15188.4)
pydev debugger: process 10992 is connecting
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.17GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
Sigkill only happens when someone explicitly kills the program. The problem here is that start_queue_runners is being called before the queue runners are created (as they are created by tf.train.batch). Also for better performance build the graph once and run it in a loop, as in:
image, label = _get_image_and_label()
image_batch, lable_batch = tf.train.batch([image, label], batch_size=1)
image_batch_uint8 = tf.cast(image_batch, tf.uint8)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
for i in range(10):
final_image = sess.run(image_batch_uint8)
imgplot = plt.imshow(final_image[0])
coord.request_stop()
coord.join(threads)
I've run a image processing script using tensorflow API. It turns out that the processing time decreased quickly when I set the for-loop outside the session running procedure. Could anyone tell me why? Is there any side-effects?
The original code:
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(len(file_list)):
start = time.time()
image_crop, bboxs_crop = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
print( 'Done image %d th in %d ms \n'% (i, ((time.time() - start)*1000)))
# image_crop, bboxs_crop, image_debug = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
labels, bboxs = filter_bbox(labels_list[i], bboxs_crop)
# Image._show(Image.fromarray(np.asarray(image_crop)))
# Image._show(Image.fromarray(np.asarray(image_debug)))
save_image(image_crop, ntpath.basename(file_list[i]))
#save_desc_file(file_list[i], labels_list[i], bboxs_crop)
save_desc_file(file_list[i], labels, bboxs)
coord.request_stop()
coord.join(threads)
The code modified:
for i in range(len(file_list)):
with tf.Graph().as_default(), tf.Session() as sess:
start = time.time()
image_crop, bboxs_crop = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
print( 'Done image %d th in %d ms \n'% (i, ((time.time() - start)*1000)))
labels, bboxs = filter_bbox(labels_list[i], bboxs_crop)
save_image(image_crop, ntpath.basename(file_list[i]))
save_desc_file(file_list[i], labels, bboxs)
The time cost in the original code would keep increasing from 200ms to even 20000ms. While after modified, the the logs messages indicate that there are more than one graph and tensorflow devices were created during running, why is that?
python random_crop_images_hongyuan.py I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcublas.so.8.0 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcudnn.so.5 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcufft.so.8.0 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcuda.so.1 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcurand.so.8.0 locally W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE3 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.1 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.2 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX2 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use FMA instructions, but these are
available on your machine and could speed up CPU computations. I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0
with properties: name: GeForce GT 730M major: 3 minor: 5
memoryClockRate (GHz) 0.758 pciBusID 0000:01:00.0 Total memory:
982.88MiB Free memory: 592.44MiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3000 th in 317 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3001 th in 325 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3002 th in 312 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3003 th in 147 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3004 th in 447 ms
My guess is that this happens because creating the session is an expensive operation. May be it could also happen that the session is not properly cleaned when the with-statement is left, so each new allocation on the device will have less resources available. In short, I would not recommend doing it this way, rather initialize just one session and try to reuse it.
EDIT:
In answer to your comment: The session is closed automatically as soon as the with-block is exited. I've read in this github issue that the memory on the GPU is only really released when the whole program exits. But I guess that when you allocate a new session after you closed the last one, Tensorflow will internally just re-use the previously allocated resources. So, in retrospective my answer is probably not very insightful. Sorry if I caused confusion.
It's not possible to be 100% certain without seeing all of your code, but I would guess that the crop_image() function is calling various TensorFlow op functions to build a graph.
It is almost never a good idea to build a graph inside a for loop. This answer explains why: some operations (such as the first Session.run() call to a new operation) take time that is linear in the number of operations in the graph. If you add more operations in each iteration, iteration i will do work that is linear in i, and so the overall execution time will be quadratic.
The modified version of your code (with a with tf.Graph().as_default(): block inside the loop) will be faster because it creates a new, empty tf.Graph in each iteration, and therefore each iteration does a constant amount of work.
An even more efficient solution would be to build the graph and session once, using tf.placeholder() tensors to represent the filename and bbox arguments to crop_image, and feeding different values to these placeholders in each iteration.