Loading hdf5 with resizable length is very very slow compared with non resizable - numpy

I have an original hdf5 file with a dataset thats of shape (3737, 224, 224, 3) and it was not extendable. i.e. no maxshape argument passed during its creation.
I decided to create a new hdf5 file and create the dataset with maxshape=(None, 224, 224, 3) such that I can resize it later. I then just copied the dataset from the original hdf5 to this new one, and saved.
The contents of the two hdf5 are exactly the same. I then tried to read all the data back, and I found significant performance degradation for the resizable version.
Original:
CPU times: user 660 ms, sys: 2.58 s, total: 3.24 s
Wall time: 6.08 s
Resizable:
CPU times: user 18.6 s, sys: 4.41 s, total: 23 s
Wall time: 49.5 s
Thats almost 10 times as slow. Is this to be expected? The file size difference is only less than 2 mb. Are there optimization tips/tricks I need to be aware of?

Upon reading the hdf5 doc carefully, it seemed that if you specify a maxshape during creation of a dataset (which enable it to be resizable in the future) also turned chunking on. This seemed to be mandatory. And the "default" chunking size it gives me by default is dataset.chunks = (234, 14, 28, 1).
According to doc, this means data are not contiguous but "haphazardly" stored in a b-tree like structure. This most likely explains the slowness I observed, it is probably doing much more extra i/o than I thought.
I set the chunk size to be the entire dataset size by passing "chunks=(3737, 224, 224, 3)" and this time, I got
CPU times: user 809 µs, sys: 837 ms, total: 838 ms
Wall time: 914 ms
Thats a big speed up in loading my (3737, 224, 224, 3) tensor. I sort of understand why chunking is a scalability solution. But the fact that it magically assign a chunk size is confusing. My context is mini-batch training for deep learning. So the optimal thing is each chunk is a mini-batch.

Related

Training with Roboflow-Train-YOLOv5 stops with a '^C'

Running Roboflow's notebook, 'Roboflow-Train-YOLOv5', stops after completion the epochs loop.
Instead a reporting the results, I get the following lines, with a ^C at the end of the 3rd line
from the end.
I would like to know the reason for the failure, and there is a way to fix it.
10 epochs completed in 0.191 hours.
Optimizer stripped from runs/train/yolov5s_results2/weights/last.pt, 14.9MB
Optimizer stripped from runs/train/yolov5s_results2/weights/best.pt, 14.9MB
Validating runs/train/yolov5s_results2/weights/best.pt...
Fusing layers...
my_YOLOv5s summary: 213 layers, 7015519 parameters, 0 gradients, 15.8 GFLOPs
Class Images Labels P R mAP#.5 mAP#.5:.95: 20% 1/5 [00:01<00:04, 1.03s/it]^C
CPU times: user 7.01 s, sys: 830 ms, total: 7.84 s
Wall time: 12min 31s
My Colab plan is Colab pro, so I guess it is not a problem of resources.

I/O in Pytorch DataLoader with np.load extremely slow on SSD

I am trying to load a relatively large batch of float16 multispectral images (BxCxHxW=800x12x256x256) to train a deep learning model. The code for the DataLoader is extremely simple:
import torch
import os
paths = os.listdir("/home/bla/data")
class MultiSpectralImageDataset(Dataset):
def __init__(self, paths):
self.paths = np.array(self.paths)
self.l = len(self.paths)
def __len__(self):
return self.l
def __getitem__(self, idx):
path = self.paths[idx]
image = np.load(path)
return image
dataset = MultiSpectralImageDataset(paths)
loader = DataLoader(dataset, batch_size=800, shuffle=True, pin_memory=True, num_workers=16, drop_last=True)
for i, X in enumerate(loader):
X = X.cuda(non_blocking=True).float()
The images are individual files on a very fast NVME SSD. I can verify the read speed of the SSD with sudo hdparm -tT /dev/nvme1n1. This gives me:
/dev/nvme1n1:
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
readonly = 0 (off)
readahead = 256 (on)
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
geometry = 1907729/64/32, sectors = 3907029168, start = 0
bla#bla:~/workspace$ sudo hdparm -tT /dev/nvme1n1
/dev/nvme1n1:
Timing cached reads: 59938 MB in 2.00 seconds = 30041.04 MB/sec
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
Timing buffered disk reads: 6308 MB in 3.00 seconds = 2102.35 MB/sec
This confirms the read speed of the SSD is over 2GB/s. However, when using PyTorch DataLoader, I am not nearly able to match this IO speed. During training, the GPU is idle (0% utilization) most of the time, and the CPU is hardly used (htop shows most cores at 0% usage, some cores at at 0.5-1.5% usage). Running iotop shows
The Total Disk Read speed never surpasses 300MB/s. If I decrease num_workers (say by half), the Total Disk Read emains the same (~200MB/s), and each individual thread doubles in read speed. In particular, I observe that every num_workers iterations, the iteration is extremely slow (takes ~1 minute). This apparently simply means that the loading from disk is too slow, as discussed in the PyTorch forum here
What's weird is that I am 99.9% confident it used to work. I remember constistently reaching almost 100% GPU utilization with the same data-loading procedure.
Things I've tried, but with no successs:
Updating Ubuntu, updating everything with apt udpate & upgrade, rebooting, powering off and restarting
Updating the SSD firmware using fwupd (no updates available)
Giving higher priority to the process by running Python using sudo and using os.nice(-10)
Making space on the SSD (30% of the storage is empty, I have run fstrim -v.
Using memmap, i.e. using the keyword in np.load(path, memmap_mode='r')
I really appreciate any help, as I've been stuck with this problem for weeks now, and what used to take 13 minutes per epoch now takes approximately 1h45 per epoch, making things infeasible to train.

denoise image segmentation (mode filtering?) - or how to vectorize this operation in numpy?

UPDATE: In my initial post I stupidly applied stats.mode patch-wise rather than along the axis of the patches. Fixing this increased my speed by a factor of 4. however its still slow and my original questions still exist: (1) can i increase the speed? (2) are there different/better/standard approaches to cleaning up noisy categorical data? Back the post:
I have some image segmentation results that are noisy and I want to clean it up. My idea was to take the mode value for (3,3) patches. This code works but its too slow.:
from sklearn.feature_extraction import image
from scipy import stats
def _mode(a,axis=None):
m,_=stats.mode(a,axis=axis)
return m
def mode_smoothing(data,kernel=(3,3)):
patches=image.extract_patches_2d(data,kernel)
nb_patches=patches.shape[0]
patches=patches.reshape(nb_patches,-1)
return _mode(patches,1).reshape(int(np.sqrt(nb_patches)),-1)
""" original method (new version is ~ 5 times faster, but still slow)
def _mode(arr):
m,_=stats.mode(arr,axis=None)
return m
def mode_smoothing(data,kernel=(3,3)):
patches=image.extract_patches_2d(data,kernel)
nb_patches=patches.shape[0]
w=int(np.sqrt(nb_patches))
o=np.array([_mode(patches[p]) for p in range(nb_patches)])
return o.reshape(w,-1)
"""
Questions:
Is there a way to do this that is much much faster? eliminate for loop/vectorize in numpy? porting to c directly or using numba etc? I struggled getting something to work along these paths
Are there better / more standard methods for accomplishing denoising like this on categorical image data?
Here is a before/after example from the mode_smoothing method above
Below I present 2 answers to question my question:
by expanding out my initial attempt into a function that numba can handle
using the above suggestion by Alex Alex, which i'll call "categorical-smoothing" (is there a standard name for this method?)
I haven't written out a mathematical proof yet but it appears this patch-wise mode smoothing is equivalent to the categorical-smoothing for the correct choice of parameters. Both lead to a big speed-boost but categorical-smoothing solution is cleaner, faster, and doesn't involve numba - so it wins.
NUMBA
#njit
def mode_smoothing(data,kernel=(3,3),step=(1,1),edges=False,high_value=False,center_boost=False):
""" mode smoothing over patches
Args:
data<np.array>: numpy array (ndim=2)
kernel<tuple[int]>: (height,width) of patch window
step<tuple[int]>: (y-step,x-step)
edges<bool>:
- if true
* include edge patches by taking mode over smaller patch window
* the returned image be the same shape as the input data
- if false
* only run over patches with the full kernel size
* the returned image will be reduced in size by the radius of the kernel
high_value<bool>:
when there are multiple possible mode values choose the highest if true,
otherwise choose the lowest value
center_boost<int|bool>:
if true, instead of using pure mode-value increase the count on the center pixel
Return
<np.array> of patch wise mode values. shape my be different than input. see `edges` above
"""
h,w=data.shape
ry=int(kernel[0]//2)
rx=int(kernel[1]//2)
sy,sx=step
_mode_vals=[]
if edges:
j0,j1=0,h
i0,i1=0,w
else:
j0,j1=ry,h-ry
i0,i1=rx,w-rx
for j in range(j0,j1,sy):
for i in range(i0,i1,sx):
ap=data[
max(j-ry,0):j+ry+1,
max(i-rx,0):i+rx+1]
cv=data[j,i]
values=np.unique(ap)
count=0
for v in values:
newcount=(ap==v).sum()
if center_boost and (v==cv):
newcount+=center_boost
if high_value:
test=newcount>=count
else:
test=newcount>count
if test:
count=newcount
mode_value=v
_mode_vals.append(mode_value)
return np.array(_mode_vals).reshape(j1-j0,i1-i0)
CATEGORICAL SMOOTHING
from scipy.signal import convolve2d
KERNEL=np.ones((3,3))
def categorical_smoothing(data,nb_categories,kernel=KERNEL):
data=np.eye(nb_categories)[:,data]
for i in range(nb_categories):
data[i]=convolve2d(data[i],kernel,mode='same')
return data.argmax(axis=0)
EQUIVALENCE/SPEED-CHECK
This is probably easy to prove but...
S=512
N=19
data=np.random.randint(0,N,(S,S))
%time o1=mode_smoothing(data,edges=True,center_boost=False)
kernel=np.ones((3,3))
%time o2=categorical_smoothing(data,N,kernel=kernel)
print((o1==o2).all())
print()
data=np.random.randint(0,N,(S,S))
%time o1=mode_smoothing(data,edges=True,center_boost=1)
kernel=np.ones((3,3))
kernel[1,1]=2
%time o2=categorical_smoothing(data,N,kernel=kernel)
print((o1==o2).all())
""" OUTPUT
CPU times: user 826 ms, sys: 0 ns, total: 826 ms
Wall time: 825 ms
CPU times: user 416 ms, sys: 7.83 ms, total: 423 ms
Wall time: 423 ms
True
CPU times: user 825 ms, sys: 3.78 ms, total: 829 ms
Wall time: 828 ms
CPU times: user 422 ms, sys: 3.91 ms, total: 426 ms
Wall time: 425 ms
True
"""

Loading large set of images kill the process

Loading 1500 images of size (1000,1000,3) breaks the code and throughs kill 9 without any further error. Memory used before this line of code is 16% of system total memory. Total size of images direcotry is 7.1G.
X = np.asarray(images).astype('float64')
y = np.asarray(labels).astype('float64')
system spec is:
OS: macOS Catalina
processor: 2.2 GHz 6-Core Intel Core i7 16 GB 2
memory: 16 GB 2400 MHz DDR4
Update:
getting the bellow error while running the code on 32 vCPUs, 120 GB memory.
MemoryError: Unable to allocate 14.1 GiB for an array with shape (1200, 1024, 1024, 3) and data type float32
You would have to provide some more info/details for an exact answer but, assuming that this is a memory error(incredibly likely, size of the images on disk does not represent the size they would occupy in memory, so that is irrelevant. In 100% of all cases, the images in memory will occupy a lot more space due to pointers, objects that are needed and so on. Intuitively I would say that 16GB of ram is nowhere nearly enough to load 7GB of images. It's impossible to tell you how much you would need but from experience I would say that you'd need to bump it up to 64GB. If you are using Keras, I would suggest looking into the DirectoryIterator.
Edit:
As Cris Luengo pointed out, I missed the fact that you stated the size of the images.

Avoiding exhausting GPU resources in convNN Tensorflow

I'm trying to run a hyperparameter optimization script, for a convNN using Tensorflow.
As you may know, TF handling of the GPU-Memory isn't that fancy(don't think it will ever be, thanks to the TPU). So my question is how do I know to choose the filter dimensions and the batchsize, so that the GPU-memory don't get exhausted.
Here's the equation that I'm thinking of:
image_shape =128x128x3(3 color channel)
batchSitze = 20 ( is the smallest possible batchsize, since I got 20 klasses)
filter_shape= fw_fh_fd[filter_width=4, filter_height=4, filter_depth=32]
As far as understood, using tf.conv2d function will need the following amount of memory:
image_width * image_height *numerofchannel*batchSize*filter_height*filter_width*filter_depth*32bit
since we're tf.float32 type for each pixel.
in the given example, the needed memory, will be :
128x128x3x20x4x4x32x32 =16106127360 (bits), which is all most 16GB of memory.
I'm not the formula is correct, so I hope to get a validation or the a correction of what I'm missing.
Actually, this will take only about 44MB of memory, mostly taken by the output.
Your input is 20x128x128x3
The convolution kernel is 4x4x3x32
The output is 20x128x128x32
When you sum up the total, you get
(20*128*128*3 + 4*4*3*32 + 20*128*128*32) * 4 / 1024**2 ≈ 44MB
(In the above, 4 is for the size in bytes of float32 and 1024**2 is to get the result in MB).
Your batch size can be smaller than your number of classes. Think about ImageNet and its 1000 classes: people are training with batch sizes 10 times smaller.
EDIT
Here is a tensorboard screenshot of the net — it reports 40MB rather than 44MB, probably because it excludes the input — and you also have all the tensor sizes I mentioned earlier.