I/O in Pytorch DataLoader with np.load extremely slow on SSD - numpy

I am trying to load a relatively large batch of float16 multispectral images (BxCxHxW=800x12x256x256) to train a deep learning model. The code for the DataLoader is extremely simple:
import torch
import os
paths = os.listdir("/home/bla/data")
class MultiSpectralImageDataset(Dataset):
def __init__(self, paths):
self.paths = np.array(self.paths)
self.l = len(self.paths)
def __len__(self):
return self.l
def __getitem__(self, idx):
path = self.paths[idx]
image = np.load(path)
return image
dataset = MultiSpectralImageDataset(paths)
loader = DataLoader(dataset, batch_size=800, shuffle=True, pin_memory=True, num_workers=16, drop_last=True)
for i, X in enumerate(loader):
X = X.cuda(non_blocking=True).float()
The images are individual files on a very fast NVME SSD. I can verify the read speed of the SSD with sudo hdparm -tT /dev/nvme1n1. This gives me:
/dev/nvme1n1:
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
readonly = 0 (off)
readahead = 256 (on)
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
geometry = 1907729/64/32, sectors = 3907029168, start = 0
bla#bla:~/workspace$ sudo hdparm -tT /dev/nvme1n1
/dev/nvme1n1:
Timing cached reads: 59938 MB in 2.00 seconds = 30041.04 MB/sec
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
Timing buffered disk reads: 6308 MB in 3.00 seconds = 2102.35 MB/sec
This confirms the read speed of the SSD is over 2GB/s. However, when using PyTorch DataLoader, I am not nearly able to match this IO speed. During training, the GPU is idle (0% utilization) most of the time, and the CPU is hardly used (htop shows most cores at 0% usage, some cores at at 0.5-1.5% usage). Running iotop shows
The Total Disk Read speed never surpasses 300MB/s. If I decrease num_workers (say by half), the Total Disk Read emains the same (~200MB/s), and each individual thread doubles in read speed. In particular, I observe that every num_workers iterations, the iteration is extremely slow (takes ~1 minute). This apparently simply means that the loading from disk is too slow, as discussed in the PyTorch forum here
What's weird is that I am 99.9% confident it used to work. I remember constistently reaching almost 100% GPU utilization with the same data-loading procedure.
Things I've tried, but with no successs:
Updating Ubuntu, updating everything with apt udpate & upgrade, rebooting, powering off and restarting
Updating the SSD firmware using fwupd (no updates available)
Giving higher priority to the process by running Python using sudo and using os.nice(-10)
Making space on the SSD (30% of the storage is empty, I have run fstrim -v.
Using memmap, i.e. using the keyword in np.load(path, memmap_mode='r')
I really appreciate any help, as I've been stuck with this problem for weeks now, and what used to take 13 minutes per epoch now takes approximately 1h45 per epoch, making things infeasible to train.

Related

Dask-Rapids data movment and out of memory issue

I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
cluster.scale(100)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
predict=km.fit_predict(x).compute()
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.
There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.

aws gpu oom issue onnx cuda

Doing predictions on AWS GPU instance g4dn.4xlarge(16gb gpu memory,64 gb cpu mem) and deployed with k8s & dockers.
Tested with (cuda10.1 + onnxruntime-gpu==1.4.0 ) and (cuda10.2 + onnxruntime-gpu==1.6.0) same error.Models are customised for our purpose,cant point to weights.
Problem is :
Getting cuda oom(out of memory) error:
Error: onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'Conv_16' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:298 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 33554432
On some backtracking:
Using nvidia-smi commands and GPU memory profiling, found for the 1st prediction and for next all predictions a constant GPU memory of ~1.8GB minimum for some models ~ 3 GB is blocked for some (I think it's blocked for multiprocess ). Releasing mem doesnt make sense , coz for next prediction same amount of mem will be blocked.
My understanding:
So at the peak, we are scaling up to 22 pods & in every pod, the model load is initialized, and hence every pod is blocking 1.8 ~ 3gb of memory & pointing to 1 GPU instance of 16 GB GPU memory.So, with 22 pods, oom is expected.
What is confusing:
Above cuda message throws oom, but gpu profiling shows memory utilisation is never more than 50% , though SM(Streaming multiprocessing) is 100% at peak(when pods scaled to 22).Attached image for refernce.
On research I understood that SM has nothing to do with oom and cuda would handle sm efficiently. Then why getting cuda oom error if only 50% mem is utilised?
Ruled out.
I ruled out memory leak from model , as it runs w/o oom error when load is low.
Why GPU and not CPU for prediction.
Want faster predictions. Ran on CPU w/o any error ,even on high load.
What I am looking for:
A solution to scale AWS GPU instances on the basis of GPU memory.If oom is reason ,scaling on GPU mem should solve problem.I can't find.
Understanding cuda msg , when mem is available why oom ?
Being very hypothetical. If there is a way to create singleton object by design or using k8s for particular model load and saled up pods can utilise that model load object for prediction rather than creating new server. BUt that would kill sense or using k8s for availabilty & scalabilty.

GluonCV - Object detection, set mx.ctx to GPU, but still using all CPU cores

I’m running an object detection routine on a server.
I set the context to the GPU, and I'm loading the model, the parameters and the data on the GPU. The program is reading from a video file or from a rtsp stream, using OpenCV.
When using nvidia-smi, I see that the selected GPU usage is at 20%, which is reasonable. However, the object detection routine is still using 750-1200 % of the CPU (basically, all of the available cores of the server).
This is the code:
def main():
ctx = mx.gpu(3)
# -------------------------
# Load a pretrained model
# -------------------------
net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_coco', pretrained=True)
# Load the webcam handler
cap = cv2.VideoCapture("video/video_01.mp4")
count_frame = 0
while(True):
print(f"Frame: {count_frame}")
# Load frame from the camera
ret, frame = cap.read()
if (cv2.waitKey(25) & 0xFF == ord('q')) or (ret == False):
cv2.destroyAllWindows()
cap.release()
print("Done!!!")
break
# Image pre-processing
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
if isinstance(frame_nd, mx.ndarray.ndarray.NDArray):
frame_nd.wait_to_read()
# Run frame through network
frame_nd = frame_nd.as_in_context(ctx)
class_IDs, scores, bounding_boxes = net(frame_nd)
if isinstance(class_IDs, mx.ndarray.ndarray.NDArray):
class_IDs.wait_to_read()
if isinstance(scores, mx.ndarray.ndarray.NDArray):
scores.wait_to_read()
if isinstance(bounding_boxes, mx.ndarray.ndarray.NDArray):
bounding_boxes.wait_to_read()
count_frame += 1
cv2.destroyAllWindows()
cap.release()
This is the output of nvidia-smi:
while this is the output of top:
The pre-processing operations are running on the CPU:
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
but is it enough to justify such a high CPU usage? In case, can I run them on GPU as well?
EDIT: I modified and copied the whole code, in response to Olivier_Cruchant's comment (thanks!)
Your CPU is likely busy because of the pre-processing load and frequent back-and-forth from memory to GPU because inference seems to be running frame-by-frame
I would suggest to try the following:
Run a batched inference (send a batch of N frames to the network) to
increase GPU usage and reduce communication
Try using NVIDIA DALI to
better use GPU for data ingestion and pre-processing (DALI MXNet reference, DALI mp4 ingestion pytorch example)

stop tensorflow and clear gram

There is something wrong with the fan my GPU. So the temperature of GPU would be too higher after running tensorflow for a while. And I can't finish my training before the overheating of gpu. So I write a script to detect the temperature and try to pause the program to let the gpu cool down. The code is like this (the threshold is setted to 45 for test):
for batch in batches:
temp = int(os.popen("nvidia-smi | awk '{if(NR == 12)print $3}' | cut -c 1,2").readline().strip())
x_batch,y_batch,user_batch,item_batch = zip(*batch)
train_step(x_batch, y_batch, user_batch, item_batch)
current_step = tf.train.global_step(sess, self.global_step)
if temp>=45:
path = saver.save(sess, checkpoint_prefix, global_step=current_step)
print("temperature of GPU is over 45! Saved model checkpoint to {}\n".format(path))
sess.close()
return (-1,path,batches)
I wrap the codes of tensorflow in one file, and call it in another one:
result = 1000
restore = False
path = None
batches = None
while result != 1:
result, path, batches = main(FLAGS,restore, path, batches)
if result == -1:
import gc
gc.collect()
time.sleep(300)
restore = True
Now ,the program can pause when the temperature is too high, but the gpu is still occupied and won't cool down. So I wonder how to stop tensorflow and clear the vgram.
The program paused when temperature is too high:
But the gpu is still occupied and can't cool down:
TensorFlow only releases all GPU memory after the program exits, that's why you see the memory is not released. Still, I think pause would help, which stops your GPUs are from working at full speed (only 73 out of 149W is being used, as shown in your figure); maybe pause longer if it doesn't cool down immediately.
Finally, this problem is solved by adding fans to cool down the GPU ...

Is there a way of determining how much GPU memory is in use by TensorFlow?

Tensorflow tends to preallocate the entire available memory on it's GPUs. For debugging, is there a way of telling how much of that memory is actually in use?
(1) There is some limited support with Timeline for logging memory allocations. Here is an example for its usage:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary, _ = sess.run([merged, train_step],
feed_dict=feed_dict(True),
options=run_options,
run_metadata=run_metadata)
train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
train_writer.add_summary(summary, i)
print('Adding run metadata for', i)
tl = timeline.Timeline(run_metadata.step_stats)
print(tl.generate_chrome_trace_format(show_memory=True))
trace_file = tf.gfile.Open(name='timeline', mode='w')
trace_file.write(tl.generate_chrome_trace_format(show_memory=True))
You can give this code a try with the MNIST example (mnist with summaries)
This will generate a tracing file named timeline, which you can open with chrome://tracing. Note that this only gives an approximated GPU memory usage statistics. It basically simulated a GPU execution, but doesn't have access to the full graph metadata. It also can't know how many variables have been assigned to the GPU.
(2) For a very coarse measure of GPU memory usage, nvidia-smi will show the total device memory usage at the time you run the command.
nvprof can show the on-chip shared memory usage and register usage at the CUDA kernel level, but doesn't show the global/device memory usage.
Here is an example command: nvprof --print-gpu-trace matrixMul
And more details here:
http://docs.nvidia.com/cuda/profiler-users-guide/#abstract
Here's a practical solution that worked well for me:
Disable GPU memory pre-allocation using TF session configuration:
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
run nvidia-smi -l (or some other utility) to monitor GPU memory consumption.
Step through your code with the debugger until you see the unexpected GPU memory consumption.
There's some code in tensorflow.contrib.memory_stats that will help with this:
from tensorflow.contrib.memory_stats.python.ops.memory_stats_ops import BytesInUse
with tf.device('/device:GPU:0'): # Replace with device you are interested in
bytes_in_use = BytesInUse()
with tf.Session() as sess:
print(sess.run(bytes_in_use))
The TensorFlow profiler has improved memory timeline that is based on real gpu memory allocator information
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/profiler#visualize-time-and-memory
tf.config.experimental.get_memory_info('GPU:0')
Currently returns the following keys:
'current': The current memory used by the device, in bytes.
'peak': The peak memory used by the device across the run of the program, in bytes.
as #V.M previously mentioned, a solution that works well is using: tf.config.experimental.get_memory_info('DEVICE_NAME')
This function returns a dictionary with two keys:
'current': The current memory used by the device, in bytes
'peak': The peak memory used by the device across the run of the program, in bytes.
The value of these keys is the ACTUAL memory used not the allocated one that is returned by nvidia-smi.
In reality, for GPUs, TensorFlow will allocate all the memory by default rendering using nvidia-smi to check for the used memory in your code useless. Even if, tf.config.experimental.set_memory_growth is set to true, Tensorflow will no more allocate the whole available memory but is going to remain in allocating more memory than the one is used and in a discrete manner, i.e. allocates 4589MiB then 8717MiB then 16943MiB then 30651 MiB, etc.
A small note concerning the get_memory_info() is that it doesn't return correct values if used in a tf.function() decorated function. Thus, the peak key shall be used after executing tf.function() decorated function to determine the peak memory used.
For older versions of Tensorflow, tf.config.experimental.get_memory_usage('DEVICE_NAME') was the only available function and only returned the used memory (no option for determining the peak memory).
Final note, you can also consider the Tensorflow Profiler available with Tensorboard as #Peter Mentioned.
Hope this helps :)