GPU available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='gpu', devices=1)` - gpu

model_nbeats = NBEATSModel(
input_chunk_length=30,
output_chunk_length=7,
generic_architecture=True,
num_stacks=10,
num_blocks=1,
num_layers=4,
layer_widths=512,
n_epochs=100,
nr_epochs_val_period=1,
batch_size=800,
model_name="nbeats_run",
)
model_nbeats.fit(train, val_series=val, verbose=True)
Hi,
While running the code above, "PossibleUserWarning: GPU available but not used. Set accelerator and devices using Trainer(accelerator='gpu', devices=1)." is appearing in the output cell of Google Colab. How can I fix it?
I have tried darts.nbeats to fit training data in 100 epochs. I select GPU to accelerate the training process. However, the training was not accelerated and the afferomentioned possiblewarning was appeared.

Related

Trying to custom train MobilenetV2 with 40x40px images - wrong results after training

I need to classify small images in 4 different categories, +1 "background" for false detection.
While training the loss quickly drop to 0.7, but stay there even after 800k steps. In the end, the frozen graph seems to classify most images with the background label.
I'm probably missing something, I'll detail the steps I used below, and any feedback is welcomed.
I'm new to tf-slim, so it can be an obvious mistake, maybe too little samples ? I'm not looking for top accuracy, just something working for prototyping.
Source materials can be found there : https://www.dropbox.com/s/k55xoygdzb2efag/TilesDataset.zip?dl=0
I used tensorflow-gpu 1.15.3 on windows 10.
I created the dataset using :
python ./createTfRecords.py --tfrecord_filename=tilesV2_40 --dataset_dir=.\tilesV2\Tiles_40
I added a dataset provider in models-master\research\slim\datasets based on the flowers provider.
I modified the mobilnet_v2.py in models-master\research\slim\nets\mobilenet, changed num_classes=5 and mobilenet.default_image_size = 40
I trained the net with : python ./models-master/research/slim/train_image_classifier.py --model_name "mobilenet_v2" --learning_rate 0.045 --preprocessing_name "inception_v2" --label_smoothing 0.1 --moving_average_decay 0.9999 --batch_size 96 --learning_rate_decay_factor 0.98 --num_epochs_per_decay 2.5 --train_dir ./weight --dataset_name Tiles_40 --dataset_dir .\tilesV2\Tiles_40
When I try this python .\models-master\research\slim\eval_image_classifier.py --alsologtostderr --checkpoint_path ./weight/model.ckpt-XXX --dataset_dir ./tilesV2/Tiles_40 --dataset_name Tiles_40 --dataset_split_name validation --model_name mobilenet_v2 I get eval/Recall_5[1]eval/Accuracy[1]
I then export the graph with python .\models-master\research\slim\export_inference_graph.py --alsologtostderr --model_name mobilenet_v2 --image_size 40 --output_file .\export\output.pb --dataset_name Tiles_40
And freeze it with freeze_graph --input_graph .\export\output.pb --input_checkpoint .\weight\model.ckpt-XXX --input_binary true --output_graph .\export\frozen.pb --output_node_names MobilenetV2/Predictions/Reshape_1
I then try the net with images from the dataset with python .\label_image.py --graph .\export\frozen.pb --labels .\tilesV2\Tiles_40\labels.txt --image .\tilesV2\Tiles_40\photos\lac\1_1.png --input_layer input --output_layer MobilenetV2/Predictions/Reshape_1. This is where I get wrong classifications.,
like 0:background 0.92839915 2:lac 0.020171663 1:house 0.019106707 3:road 0.01677236 4:start 0.0155500565 for a "lac" image of the dataset
I tried changing the depth_multiplier, the learning rate, learning on a cpu, removing --preprocessing_name "inception_v2" from the learning command. I don't have any idea left...
Change your learning rate, maybe start from the usual choice of 3e-5.

GluonCV - Object detection, set mx.ctx to GPU, but still using all CPU cores

I’m running an object detection routine on a server.
I set the context to the GPU, and I'm loading the model, the parameters and the data on the GPU. The program is reading from a video file or from a rtsp stream, using OpenCV.
When using nvidia-smi, I see that the selected GPU usage is at 20%, which is reasonable. However, the object detection routine is still using 750-1200 % of the CPU (basically, all of the available cores of the server).
This is the code:
def main():
ctx = mx.gpu(3)
# -------------------------
# Load a pretrained model
# -------------------------
net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_coco', pretrained=True)
# Load the webcam handler
cap = cv2.VideoCapture("video/video_01.mp4")
count_frame = 0
while(True):
print(f"Frame: {count_frame}")
# Load frame from the camera
ret, frame = cap.read()
if (cv2.waitKey(25) & 0xFF == ord('q')) or (ret == False):
cv2.destroyAllWindows()
cap.release()
print("Done!!!")
break
# Image pre-processing
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
if isinstance(frame_nd, mx.ndarray.ndarray.NDArray):
frame_nd.wait_to_read()
# Run frame through network
frame_nd = frame_nd.as_in_context(ctx)
class_IDs, scores, bounding_boxes = net(frame_nd)
if isinstance(class_IDs, mx.ndarray.ndarray.NDArray):
class_IDs.wait_to_read()
if isinstance(scores, mx.ndarray.ndarray.NDArray):
scores.wait_to_read()
if isinstance(bounding_boxes, mx.ndarray.ndarray.NDArray):
bounding_boxes.wait_to_read()
count_frame += 1
cv2.destroyAllWindows()
cap.release()
This is the output of nvidia-smi:
while this is the output of top:
The pre-processing operations are running on the CPU:
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
but is it enough to justify such a high CPU usage? In case, can I run them on GPU as well?
EDIT: I modified and copied the whole code, in response to Olivier_Cruchant's comment (thanks!)
Your CPU is likely busy because of the pre-processing load and frequent back-and-forth from memory to GPU because inference seems to be running frame-by-frame
I would suggest to try the following:
Run a batched inference (send a batch of N frames to the network) to
increase GPU usage and reduce communication
Try using NVIDIA DALI to
better use GPU for data ingestion and pre-processing (DALI MXNet reference, DALI mp4 ingestion pytorch example)

Detailed composition of TensorFlow device names?

What are "suffixes" such as Allocators, Compute, and Tensors in device names such as the following:
Allocators
/job:localhost/replica:0/task:0/device:CPU:0 Compute
/job:localhost/replica:0/task:0/device:CPU:0 Tensors
Are these for processes (or threads) on a given CPU? But then what about the first case, where there is no CPU specified?
BTW, I've seen these examples reported in timeline.json (serialized from a Timeline).
It seems as if those suffixes are not native to TensorFlow, but artifacts created when it produces a trace in Chrome Trace Format with Timeline.generate_chrome_trace_format().
These are the relevant lines of source code:
self._chrome_trace.emit_pid(dev_stats.device + ' Compute', device_pid)
self._chrome_trace.emit_pid(dev_stats.device + ' Tensors', tensors_pid)
I believe one shows the timeline of ops running on your device and the other shows the timeline of tensors running on that device.

TensorFlow Norm (LRN) doesn't support GPU

I am running following code on Google Cloud ML using BASIC GPU (Tesla K80)
https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10
LRN is taking the most amount of time and its running on CPU. I am wondering if following stats quoted in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_train.py were obtained by running on CPU because I don't see thats the case.
System | Step Time (sec/batch) | Accuracy
1 Tesla K20m | 0.35-0.60 | ~86% at 60K steps (5 hours)
If I force it to run it with GPU it throws following error:
Cannot assign a device to node 'norm1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. [[Node: norm1 = LRNT=DT_HALF, alpha=0.00011111111, beta=0.75, bias=1, depth_radius=4, _device="/device:GPU:0"]

Tensorflow: dynamically call GPUs with enough free memory

My desktop has two gpus which can run Tensorflow with specification /gpu:0 or /gpu:1. However, if I don't specify which gpu to run the code, Tensorflow will by default to call /gpu:0, as we all know.
Now I would like to setup the system such that it can assign gpu dynamically according to the free memory of each gpu. For example, if a script doesn't specify which gpu to run the code, the system first assigns /gpu:0 for it; then if another script runs now, it will check whether /gpu:0 has enough free memory. If yes, it will continue assign /gpu:0 to it, otherwise it will assign /gpu:1 to it. How can I achieve it?
Follow-ups:
I believe the question above may be related to the virtualization problem of GPU. That is to say, if I can virtualize multi-gpu in a desktop into one GPU, I can get what I want. So beside any setup methods for Tensorflow, any ideas about virtualization is also welcome.
TensorFlow generally assumes it's not sharing GPU with anyone, so I don't see a way of doing it from inside TensorFlow. However, you could do it from outside as follows -- shell script that calls nvidia-smi, parses out GPU k with more memory, then sets "CUDA_VISIBLE_DEVICES=k" and calls TensorFlow script
Inspired by:
How to set specific gpu in tensorflow?
def leave_gpu_with_most_free_ram():
try:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = _output_to_list(sp.check_output(command.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
least_busy_idx = memory_free_values.index(max(memory_free_values))
# update CUDA variable
gpus =[least_busy_idx]
setting = ','.join(map(str, gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = setting
print('Left next %d GPU(s) unmasked: [%s] (from %s available)'
% (leave_unmasked, setting, str(available_gpus)))
except FileNotFoundError as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked')
print(e)
except sp.CalledProcessError as e:
print("Error on GPU masking:\n", e.output)
Add a call to this function before importing tensorflow